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ABSTRACT 

Query execution over the Web of Linked Data has attracted much 
attention recently. A particularly interesting approach is link traver- 
sal based query execution which proposes to integrate the traversal 
of data links into the creation of query results. Hence -in contrast 
to traditional query execution paradigms- this does not assume a 
fixed set of relevant data sources beforehand; instead, the traversal 
process discovers data and data sources on the fly and, thus, enables 
applications to tap the full potential of the Web. 

While several authors have studied possibilities to implement the 
idea of link traversal based query execution and to optimize query 
execution in this context, no work exists that discusses theoretical 
foundations of the approach in general. Our paper fills this gap. 

We introduce a well-defined semantics for queries that may be 
executed using a link traversal based approach. Based on this se- 
mantics we formally analyze properties of such queries. In partic- 
ular, we study the computability of queries as well as the implica- 
tions of querying a potentially infinite Web of Linked Data. Our 
results show that query computation in general is not guaranteed 
to terminate and that for any given query it is undecidable whether 
the execution terminates. Furthermore, we define an abstract exe- 
cution model that captures the integration of link traversal into the 
query execution process. Based on this model we prove the sound- 
ness and completeness of link traversal based query execution and 
analyze an existing implementation approach. 

Categories and Subject Descriptors 

H.3.3 [Information Storage and Retrieval]: Information Search 
and Retrieval; F.1.1 [Computation by Abstract Devices]: Models 
of Computation 

General Terms 

Management, Theory 

Keywords 

link traversal based query execution, query semantics, computabil- 
ity, Web of Data, Linked Data 



*This report presents an extended version of a paper published in HT 
2012 (8]. The extended version contains proofs for all propositions, lem- 
mas, and theorems in the paper (cf. Appendix IB). 



1, INTRODUCTION 

During recent years an increasing number of data providers adopted 
the Linked Data principles for publishing and interlinking struc- 
tured data on the World Wide Web (WWW) (M- The Web of 
Linked Data that emerges from this process enables users to benefit 
from a virtually unbounded set of data sources and, thus, opens pos- 
sibilities not conceivable before. Consequently, the Web of Linked 
Data has spawned research to execute declarative queries over mul- 
tiple Linked Data sources. Most approaches adapt techniques that 
are known from the database literature (e.g. data warehousing or 
query federation). However, the Web of Linked Data is differ- 
ent from traditional database systems; distinguishing characteris- 
tics are its unbounded nature and the lack of a database catalog. 
Due to these characteristics it is impossible to know all data sources 
that might contribute to the answer of a query. In this context, tra- 
ditional query execution paradigms are insufficient because those 
assume a fixed set of potentially relevant data sources beforehand. 
This assumption presents a restriction that inhibits applications to 
tap the full potential of the Web; it prevents a serendipitous discov- 
ery and utilization of relevant data from unknown sources. 

An alternative to traditional query execution paradigms are ex- 
ploration approaches that traverse links on the Web of Linked Data. 
These approaches enable a query execution system to automatically 
discover the most recent data from initially unknown data sources. 

The prevalent example of an exploration based approach is link 
traversal based query execution. The idea of this approach is to 
intertwine the traversal of data links with the construction of the 
query result and, thus, to integrate the discovery of data into the 
query execution process |7]- This general idea may be implemented 
in various ways. For instance, Ladwig and Tran introduce an asyn- 
chronous implementation that adapts the concept of symmetric hash 
joins 11311141 ; Schmedding proposes an implementation that incre- 
mentally adjusts the answer to a query each time the execution sys- 
tem retrieves additional data |18l ; our earlier work focuses on an 
implementation that uses a synchronous pipeline of iterators, each 
of which is responsible for a particular part of the query ||6l |7). 
All existing publications focus on approaches for implementing the 
idea of link traversal based query execution and on query optimiza- 
tion in the context of such an implementation. To our knowledge, 
no work exists that provides a general foundation for this new query 
execution paradigm. 

We argue that a well-defined query semantics is essential to com- 
pare different query execution approaches and to verify implemen- 
tations. Furthermore, a proper theoretical foundation enables a for- 
mal analysis of fundamental properties of queries and query exe- 
cutions. For instance, studying the computability of queries may 
answer whether particular query executions are guaranteed to ter- 



1 SELECT ?p ?1 WHERE { 

2 <http://bob.name> <http://.../knows> ?p . 

3 ?p <http://.../currentProject> ?pr . 

4 ?pr <http://.../Iabel> ?1 . } 

Figure 1: Sample query presented in the language SPARQL. 

minate. In addition to these more theoretical questions, an under- 
standing of fundamental properties and limitations may help to gain 
new insight into challenges and possibilities for query planning and 
optimization. Therefore, in this paper we provide such a formal 
foundation of Linked Data queries and link traversal based query 
execution. Our contributions are: 

1. ) As a basis, we introduce a theoretical framework that comprises 
a data model and a computation model. The data model formalizes 
the idea of a Web of Linked Data; the computation model captures 
the limited data access capabilities of computations over the Web. 

2. ) We present a query model that introduces a well-defined seman- 
tics for conjunctive queries (which is the type of queries supported 
by existing link traversal based systems). Basically, the result of 
such a query is the set of all valuations that map the query to a sub- 
set of all Linked Data that is reachable, starting with entity iden- 
tifiers mentioned in the query. We emphasize that our model does 
not prescribe a specific notion of reachability; instead, it is possible 
to make the notion of reachability applied to answer a query can be 
made explicit (by specifying which data links should be followed). 

3. ) We formally analyze properties of our query model. In par- 
ticular, we study the implications of querying a potentially infi- 
nite Web and show that it is undecidable whether a query result 
will be finite or infinite. Furthermore, we analyze the computabil- 
ity of queries by adopting earlier work on Web queries which dis- 
tinguishes finitely computable queries, eventually computable que- 
ries, and queries that are not even eventually computable. We prove 
that queries in our model are eventually computable. Hence, a link 
traversal based query execution system does not have to deal with 
queries that are not computable at all. However, we also show that 
it is undecidable whether a particular query execution terminates. 

4. ) We define an abstract query execution model that formalizes the 
general idea of link traversal based query execution. This model 
captures the approach of intertwining link traversal and result con- 
struction. Based on this model we prove the soundness and com- 
pleteness of the new query execution paradigm. 

5. ) Finally, we use our execution model to formally analyze a par- 
ticular implementation of link traversal based query execution. 

This paper is organized as follows: In Section[2]we present an ex- 
ample that demonstrates the idea of link traversal based query ex- 
ecution. Section [3] defines our data model and our computation 
model. We present our query model in Section |4] and discuss its 
properties in Section|5] Section|6]introduces the corresponding ex- 
ecution model. Finally, we discuss related work in Section |7] and 
conclude the paper in Section[8l Appendix iBlprovides all proofs. 

2. EXAMPLE EXECUTION 

Link traversal based query execution is a novel query execution 
paradigm tailored to the Web of Linked Data. Since adhering to the 
Linked Data principles is the minimal requirement for publishing 
Linked Data on the WWW, the link traversal approach relies solely 
on these principles; it does not assume that each data source pro- 
vides a data-local query interface (as would be required for query 
federation). The only way to obtain data is via URI look-ups. 

Usually, Linked Data on the WWW is represented using the RDF 
data model (ill and queries are expressed using SPARQL |17| . 



( http://bob.name , http://... /knows , http://alice.name ) G G*b 

( http://alice.name , http://.. ./name , "Ahce" ) G G*a 

( http;//alice.name , http://.../currentProject , http://.../AlicesPrj ) G G*a 

( http://.../AlicesPrj , http://.. ./label , "Alice's Project" ) £ Gp 

Figure 2: Excerpts from Linked Data retrieved from the Web. 

SPARQL queries consist of RDF graph patterns that contain query 
variables, denoted with the symbol '?'. The semantics of SPARQL 
is based on pattern matching 1161 . Figure [T] provides a SPARQL 
representation of a query that asks for projects of acquaintances of 
user Bob, who is identified by URI http://bob.name. In lines 2 to 4 
the query contains a conjunctive query represented as a set of three 
SPARQL triple patterns. In the following we outline a link traversal 
based execution of this conjunctive query. 

Link traversal based query execution usually starts with an emp- 
ty, query-local dataset. We obtain some seed data by looking up 
the URIs mentioned in the query: For the URI http://bob.name in 
our sample query we may retrieve a set Gh of RDF triples (cf . Fig- 
ure O, which we add to the local dataset. Now, we alternate be- 
tween i) constructing valuations from RDF triples that match a 
pattern of our query in the query-local dataset, and ii) augment- 
ing the dataset by looking up URIs which are part of these valua- 
tions. For the triple pattern in line 2 of our sample query the local 
dataset contains a matching triple, originating from Gh- Hence, 
we can construct a valuation /ii = {?p — >■ http://alice.name} that 
maps query variable ?p to the URI http://alice.name. By look- 
ing up this URI we may retrieve a set Ga of RDF triples, which 
we also add to the query-local dataset. Based on the augmented 
dataset we can extend /ii by adding a binding for ?pr. We obtain 
A'2 = {?p — >■ http://alice.name, ?pr —s- http://.../AlicesPrj}, which 
already covers the pattern in line 2 and 3. Notice, constructing ^2 
is only possible because we retrieved Ga. However, before we dis- 
covered and resolved the URI http://alice.name, we neither knew 
about Ga nor about the existence of the data source from which 
we retrieved Ga. Hence, the traversal of data links enables us to 
answer queries based on data from initially unknown sources. 

We proceed with our execution strategy as follows: We discover 
and retrieve Gp by looking up the URI http://.../AlicesPrj and ex- 
tend /i2 to /i3 = {?p — >■ http://alice.name, ?pr http://.../AlicesPrj, 
?1 —7- "Alice's Project"}, which now covers the whole, conjunctive 
query. Hence, can be reported as the result of that query. 

3. MODELING A WEB OF LINKED DATA 

In this section we introduce theoretical foundations which shall al- 
low us to define and to analyze queries over Linked Data. In partic- 
ular, we propose a data model and a computation model. For these 
models we assume a static view of the Web; that is, no changes are 
made to the data on the Web during the execution of a query. 

3.1 Data Model 

The WWW is the most prominent implementation of a Web of 
Linked Data and it shows that the idea of Linked Data scales to 
a virtually unlimited dataspace. Nonetheless, other implementa- 
tions are possible (e.g. within the boundaries of a closed, globally 
distributed corporate network). Such an implementation may be 
based on the same technologies used for the WWW (i.e. HTTP, 
URIs, RDF, etc.) or it may use other, similar technologies. Con- 
sequently, our data model abstracts from the concrete technologies 
that implement Linked Data in the WWW and, thus, enables us to 
study queries over any Web of Linked Data. 

As a basis for our model we use a simple, triple based data model 
for representing the data that is distributed over a Web of Linked 



Data (similar to thie RDF data model that is used for Linked Data 
on the WWW). We assume a countably infinite set X of possible 
identifiers (e.g. all URLs) and a countably infinite set £ of all possi- 
ble constant literals (e.g. all possible strings, natural numbers, etc.). 
X and £ are disjoint. A data triple is a tuple t £ X x X x (Iu£). 
To denote the set of all identifiers in a data triple t we write ids(f). 

We model a Web of Linked Data as a potentially infinite struc- 
ture of interlinked documents. Such documents, which we call 
Linked Data documents, or LD document?, for short, are accessed 
via identifiers in X and contain data that is represented as a set of 
data triples. The following definition captures our approach: 

Definition 1. A Web of Linlted Data Wis a tuple (D, data, adoc) 

where: 

• D is a set of symbols that represent LD documents; D may 
be finite or countably infinite. 

• data : D ~> 2'-^^^^^^'^^ is a total mapping such that 
data{d) is finite for all d G D. 

• adoc : X — s- D is a partial, surjective mapping. 

While the three elements D, data, and adoc completely define a 
Web of Linked Data in our model, we point out that these elements 
are not directly available to a query execution system. However, 
by retrieving LD documents, such a system may gradually obtain 
information about the Web. Based on this information the system 
may (partially) materialize these three elements. In the remainder 
of this section we discuss the three elements and introduce addi- 
tional concepts that we need to define our query model. 

We say a Web of Linked Data W — {D, data, adoc) is infinite if 
and only if D is infinite; otherwise, we say Wis finite. Our model 
allows for infinite Webs to cover the possibility that Linked Data 
about an infinite number of identifiable entities is generated on the 
fly. The following example illustrates such a case: 

Example 1. Let Ui denote an HTTP scheme based URI that iden- 
tifies the natural number i. There is a countably infinite number 
of such URIs. The WWW server which is responsible for these 
URIs may be set up to provide a document for each natural num- 
ber. These documents may be generated upon request and may 
contain RDF data including the RDF triple [ui, http://.../ne.xt, Ui+i). 
This triple associates the natural number i with its successor 
and, thus, links to the data about i+l |19| . An examplefor such a 
server is provided by the Linked Open Numbers projecW 

Another example were data about an infinite number of entities may 
be generated is the LinkedGeoData project which provides Linked 
Data about any circular and rectangular area on Earth (2). These 
examples illustrate that an infinite Web of Linked Data is possible 
in practice. Covering these cases enables us to model queries over 
such data and analyze the effects of executing such queries. 

Even if a Web of Linked Data is infinite, we require countability 
for D. We shall see that this requirement has nontrivial conse- 
quences: It limits the potential size of Webs of Linked Data in our 
model and, thus, allows us to use a Turing machine based model 
for analyzing computability of queries over Linked Data (cf. Sec- 
tion |5]2](. We emphasize that the requirement of countability does 
not restrict us in modeling the WWW as a Web of Linked Data: In 
the WWW we use URIs to locate documents that contain Linked 
Data. Even if URIs are not limited in length, they are words over a 
finite alphabet. Thus, the infinite set of all possible URIs is count- 
able, as is the set of all documents that may be retrieved using URIs. 

'http://km.aifb.kit.edu/projects/numbers/ 
^ http ://linkedgeodata. org 



The mapping data associates each LD document d € -D in a 
Web of Linked Data W= {D, data, adoc) with a finite set of data 
triples. In practice, these triples are obtained by parsing d after d 
has been retrieved from the Web. The actual retrieval mechanism 
depends on the technologies that are used to implement the Web of 
Linked Data. To denote the potentially infinite (but countable) set 
of all data triples in W^we write AllData(M^); i.e. it holds: 

AUDataCtV) = [J data{d) 

deD 

Since we use elements in the set X as identifiers for entities, we 
say that an LD document d £ D describes the entity identified by 
an identifier id G I if 3{s,p, o) € data{d) : {s = id\/ o — id). 
Notice, while there might be multiple LD documents in D that de- 
scribe an entity identified by id, we do not assume that we can 
enumerate the set of all these documents; i.e., we cannot discover 
and retrieve all of them. The possibility to query search engines is 
out of scope of this paper. It is part of our future work to extend the 
semantics in our query model in order to take data into account, that 
is reachable by utilizing search engines. However, according to the 
Linked Data principles, each id £ X may also serve as a reference 
to a specific LD document which is considered as an authoritative 
source of data about the entity identified by id. We model the re- 
lationship between identifiers and authoritative LD documents by 
mapping adoc. Since some LD documents may be authoritative for 
multiple entities, we do not require injectivity for adoc. The "real 
world" mechanism for dereferencing identifiers (i.e. learning about 
the location of the corresponding, authoritative LD document) de- 
pends on the implementation of the Web of Linked Data and is not 
relevant for our model. For each identifier id £ X that cannot be 
dereferenced (i.e. "broken links") or that is not used in the Web it 
holds id ^ dom(adoc). 

An identifier id £ X with id £ dom(adoc) that is used in the 
data of an LD document di £ D constitutes a data link to the LD 
document d2 — adoc{id) G D. To formally represent the graph 
structure that is formed by such data links, we introduce the notion 
of a Web link graph. The vertices in such a graph represent the LD 
documents of the corresponding Web of Linked Data; the edges 
represent data links and are labeled with a data triple that denotes 
the corresponding link in the source document. Formally: 

Definition 2. Let W= [D, data, adoc) be a Web of Linked Data. 
The Web link grapli for W, denoted by G^, is a directed, edge- 
labeled multigraph {V, E) where V = D and 

-E = { (dh, dt,t) I dh, dt £ D and t £ data{dh) and 
3 id G ids(t) : adoc{id) = dt} 

In our query model we introduce the concept of reachable parts 
of a Web of Linked Data that are relevant for answering queries; 
similarly, our execution model introduces a concept for those parts 
of a Web of Linked Data that have been discovered at a certain point 
in the query execution process. To provide a formal foundation for 
these concepts we define the notion of an induced subweb which 
resembles the concept of induced subgraphs in graph theory. 

Definition 3. Let W= {D, data, adoc) be a Web of Linked Data. 
A Web of Linked Data W' — {D', data', adoc') is an induced 
subweb otWif: 

1. D' C D, 

2. yd£ D' : data'{d) = data{d), and 

3. y id £ {id £ X \ adoc{id) £ D'} : adoc' (id) = adoc{id). 



It can be easily seen from Definition [3] that specifying D' is suf- 
ficient to define an induced subweb [D', data, adoc') of a given 
Web of Linked Data unambiguously. Furthermore, it is easy to ver- 
ify that for an induced subweb W' of a Web of Linked Data W it 
holds AllData(W') C AllData(W). 

3.2 Computation Model 

Usually, functions are computed over structures that are assumed 
to be fully (and directly) accessible. A Web of Linked Data, in 
contrast, is a structure in which accessibility is limited: To discover 
LD documents and access their data we have to dereference identi- 
fiers, but the full set of those identifiers for which we may retrieve 
documents is unknown. Hence, to properly analyze queries over a 
Web of Linked Data we require a model for computing functions 
on such a Web. This section introduces such a model. 

In earlier work about computation on the WWW, Abiteboul and 
Vianu introduce a specific Turing machine called Web machine [T] . 
Mendelzon and Milo propose a similar machine model |15| . These 
machines formally capture the limited data access capabilities on 
the WWW and thus present an adequate abstraction for computa- 
tions over a structure such as the WWW. We adopt the idea of such 
a Web machine to our scenario of a Web of Linked Data. We call 
our machine a Linked Data machine (or LD machine, for short). 

Encoding (fragments of) a Web of Linked Data W = {D, data, 
adoc) on the tapes of such a machine is straightforward because all 
relevant structures, such as the sets D or I, are countably infinite. 
In the remainder of this paper we write enc(a:) to denote the en- 
coding of some element x (e.g. a single data triple, a set of triples, 
a full Web of Linked Data, etc.). For a detailed definition of the 
encodings we use in this paper, we refer to Appendix IaI 

We now define our adaptation of the idea of Web machines: 

Definition 4. An LD macliine is a multi-tape Turing machine 
with five tapes and a finite set of states, including a special state 
called expand. The five tapes include two, read-only input tapes: 
i) an ordinary input tape and ii) a right-infinite Web tape which 
can only be accessed in the expand state; two work tapes: iii) an 
ordinary, two-way infinite work tape and iv) a right-infinite link 
traversal tape; and v) a right-infinite, append-only output tape. 
Initially, the work tapes and the output tape are empty, the Web 
tape contains a (potentially infinite) word that encodes a Web of 
Linked Data, and the ordinary input tape contains an encoding 
of further input (if any). Any LD machine operates like an or- 
dinary multi-tape Turing machine except when it reaches the ex- 
pand state. In this case LD machines perform the following ex- 
pand procedure: The machine inspects the word currently stored 
on the link traversal tape. If the suffix of this word is the en- 
coding enc(ici) of some identifier id £ X and the word on the 
Web tape contains (t enc(ici) enc(adoc(id)) (j , then the machine 
appends enc[adoc{id)) jj to the (right) end of the word on the link 
traversal tape by copying from the Web tape; otherwise, the ma- 
chine appends (j to the word on the link traversal tape. 

Notice how an LD machine is limited in the way it may access a 
Web of Linked Data that is encoded on its Web (input) tape: Any 
LD document and its data is only available for the computation after 
the machine performed the expand procedure using a correspond- 
ing identifier. Hence, the expand procedure models a URI based 
lookup which is the (typical) data access method on the WWW. 

In the following sections we use the notion of an LD machine for 
analyzing properties of our query model. In this context we aim to 
discuss decision problems that shall have a Web of Linked Data W 
as input. For these problems we assume that the computation may 
only be performed by an LD machine with enc( VF) on its Web tape: 



Definition 5. Let W be a (potentially infinite) set of Webs of Lin- 
ked Data; let X be an arbitrary (potentially infinite) set of finite 
structures; and let DP <^V\> x X . The decision problem for DP, 
that is, to decide for any (W^, X) eWxX whether {W,X) G DP, 
is LD mactiine decidable if there exist an LD machine whose com- 
putation on any W G W encoded on the Web tape and any X £ X 
encoded on the ordinary input tape, has the following property: The 
machine halts in an accepting state if {W, X) £ DP; otherwise the 
machine halts in a rejecting state. 

Obviously, any (Turing) decidable problem that does not have a 
Web of Linked Data as input, is also LD machine decidable be- 
cause LD machines are Turing machines; for these problems the 
corresponding set W is empty . 

4. QUERY MODEL 

This section introduces our query model by defining semantics for 
conjunctive queries over Linked Data. 

4.1 Preliminaries 

We assume an infinite set V of possible query variables that is dis- 
joint from the sets T and £ introduced in the previous section. 
These variables will be used to range over elements in XU £. Thus, 
valuations in our context are total mappings from a finite subset of 
V to the setIU£. We denote the domain of a particular valuation fj, 
by dom(/x). Using valuations we define our general understanding 
of queries over a Web of Linked Data as follows: 

Definition 6. Let W be a set of all possible Webs of Linked Data 
(i.e. all 3-tuples that correspond to Definition [TJ and let SI be a 
set of all possible valuations. A Linked Data query g is a total 
function g : >V -s> 2". 

To express conjunctive Linked Data queries we adapt the notion of 
a SPARQL basic graph pattern 1171 to our data model: 

Definition 7. A basic query pattern (BQP) is a finite set B = 
{tpi, ... , tpn] of tuples tpi G (V U J) X (V U J) X (V U X U C) 
(for 1 < i < n). We call such a tuple a triple pattern. 

In comparison to traditional notions of conjunctive queries, triple 
patterns are the counterpart of atomic formulas; furthermore, BQPs 
have no head, hence no bound variables. To denote the set of 
variables and identifiers that occur in a triple pattern tp we write 
vars(fp) and ids(fp), respectively. Accordingly, the set of vari- 
ables and identifiers that occur in all triple patterns of a BQP B is 
denoted by vars(B) and ids(B), respectively. For a triple pattern 
tp and a valuation /i we write ^[tp] to denote the triple pattern that 
we obtain by replacing the variables in tp according to /i. Similarly, 
a valuation /i is applied to a BQP B by ^[B] — {fi[tp] j tp G B}. 
The result of ultp] is a data triple if vars(tp) C dom(/i). Accord- 
ingly, we introduce the notion of matching data triples: 

Definition 8. A data triple t matcties a triple pattern tp if there 
exists a valuation n such that fi[tp] — t. 

While BQPs are syntactic objects, we shall use them as a represen- 
tation of Linked Data queries which have a certain semantics. In the 
remainder of this section we define this semantics. Due to the open- 
ness and distributed nature of Webs such as the WWW we cannot 
guarantee query results that are complete w.r.t. all Linked Data on 
a Web. Nonetheless, we aim to provide a well-defined semantics. 
Consequently, we have to limit our understanding of completeness. 
However, instead of restricting ourselves to data from a fixed set 
of sources selected or discovered beforehand, we introduce an ap- 
proach that allows a query to make use of previously unknown data 



and sources. Our definition of query semantics is based on a two- 
phase approachi: First, we define the part of a Web of Linked Data 
that is reached by traversing links using the identifiers in a query 
as a starting point. Then, we formalize the result of such a query 
as the set of all valuations that map the query to a subset of all 
data in the reachable part of the Web. Notice, while this two-phase 
approach provides for a straightforward definition of the query se- 
mantics in our model, it does not correspond to the actual query 
execution strategy of integrating the traversal of data links into the 
query execution process as illustrated in Section[2l 

4.2 Reachability 

To introduce the concept of a reachable part of a Web of Linked 
Data we first define reachability of LD documents. Informally, an 
LD document is reachable if there exists a (specific) path in the Web 
link graph of a Web of Linked Data to the document in question; 
the potential starting points for such a path are LD documents that 
are authoritative for entities mentioned (via their identifier) in the 
queries. However, allowing for arbitrary paths might be question- 
able in practice because it would require following all data links 
(recursively) for answering a query completely. A more restrictive 
approach is the notion of query pattern based reachability where 
a data link only qualifies as a part of paths to reachable LD doc- 
uments, if that link corresponds to a triple pattern in the executed 
query. The link traversal based query execution illustrated in Sec- 
tion|2]applies this notion of query pattern based reachability (as we 
show in Section[63). Our experience in developing a link traversal 
based query execution systerr[3 suggests that query pattern based 
reachability is a good compromise for answering queries without 
crawling large portions of the Web that are likely to be irrelevant 
for the queries. However, other criteria for specifying which data 
links should be followed might prove to be more suitable in certain 
use cases. For this reason, we do not prescribe a specific criterion 
in our query model; instead, we enable our model to support any 
possible criterion by making this concept part of the model. 

Definition 9. Let T be the infinite set of all possible data triples; let 
B be the infinite set of all possible BQPs. A reachability criterion 

c is a total computable function c : T x X x S — > {true, false}. 

An example for such a reachability criterion is cam which corre- 
sponds to the approach of allowing for arbitrary paths to reach LD 
documents; hence, for each tuple {t, id,B) G T x I x S it holds 
ci\\\{t,id, B) = true. The complement of can is CMone which al- 
ways returns false. Another example is CMatch which corresponds 
to the aforementioned query pattern based reachability. We define 
CMatch based on the notion of matching data triples: 



/ \ true 



j true if3tp £ B : t matches tp, 



else. 



(1) 



We call a reachability criterion ci less restrictive than another cri- 
terion C2 if i) for each tuple [t, id, B) £ T x I x B for which 
C2{t,id, B) = true, also holds ci{t,id, B) = true and ii) there 
exist a {t', id', B') £T xXxB such that ci{t' ,id' , B') = true 
but C2(t', id' , B') — false. It can be seen that cam is the least re- 
strictive criterion, whereas CMone is the most restrictive criterion. 

Using the concept of reachability criteria for data links we for- 
mally define reachability of LD documents: 

Definition 10. Let W = {D, data, adoc) be a Web of Linked 
Data; let S C I be a finite set of seed identifiers; let c be a reach- 
ability criterion; and let B be a BQP. An LD document d G D is 
(c, i?) -reachable from g in if either 
'http://squin.org 



1. there exists an id £ S such that adoc{id) — d; or 

2. there exist another LD document d' £ D, a t £ data{d'), 
and an id G ids(t) such that i) d' is (c, i3)-reachable from S 
in W, ii) c{t, id, B) = true, and iii) adoc{id) — d. 

We note that each LD document which is authoritative for an entity 
mentioned (via its identifier) in a finite set of seed identifiers 5", is 
always reachable from S in the corresponding Web of Linked Data, 
independent of the reachability criterion and the BQP used. 

Based on reachability of LD documents we now define reachable 
parts of a Web of Linked Data. Informally, such a part is an induced 
subweb covering all reachable LD documents. Formally: 

Definition 11. Let W = {D, data, adoc) be a Web of Linked 
Data; let S C X be a finite set of seed identifiers; let c be a reacha- 
bility criterion; and let B be a BQP. The (S, c, B)-reachable part 
of Wis the induced subweb Wc^'^^ = {Dt^,datat^, adocy^) of W 
that is defined by 

D^x = {d e D\dis (c, B)-reachable from S in W} 

4.3 Query Results 

Based on the previous definitions we define the semantics of con- 
junctive Linked Data queries that are expressed via BQPs. Recall 
that Linked Data queries map from a Web of Linked Data to a set 
of valuations. Our interpretation of BQPs as Linked Data queries 
requires that each valuation /i in the result for a particular BQP B 
satisfies the following requirement: If we replace the variables in 
B according to p (i.e. we compute p[B]), we obtain a set of data 
triples and this set must be a subset of all data in the part of the 
Web that is reachable according to the notion of reachability that 
we apply. Since our model supports a virtually unlimited number 
of notions of reachability, each of which is defined by a particular 
reachability criterion, the actual result of a query must depend on 
such a reachability criterion. The following definition formalizes 
our understanding of conjunctive Linked Data queries: 



Definition 12. Let S C X be a finite set of seed identifiers; let c 
be a reachability criterion; and let B be a BQP; let Wbe a Web of 
Linked Data; let Wc'^'^^ denote the {S, c, _B)-reachable part of W. 
The conjunctive Linked Data query (CLD query) that uses B, 
S, and c, denoted by Q^'^ , is a Linked Data query defined as: 

Qc'^iyV) = I ^ is a valuation with Aam{p) = vars(i3) 

and p[B] C AllData(W^i^'-^>) } 

Each ^ G Qf '^(W^ is a solution for Qf in W. 

Since we define the result of queries w.rt. a reachability criterion, 
the semantics of such queries depends on this criterion. Thus, 
strictly speaking, our query model introduces a family of query se- 
mantics, each of which is characterized by a reachability criterion. 
Therefore, we refer to a CLD query for which we use a particular 
reachability criterion c as a CLD query under c-semantics. 

5. PROPERTIES OF THE QUERY MODEL 

In this section we discuss properties of our query model. In particu- 
lar, we focus on the implications of querying Webs that are infinite 
and on the (LD machine based) computability of queries. 

5.1 Querying an Infinite Web of Linked Data 

From Definitions [To] and [TT] in Section|4]it can be easily seen that 
any reachable part of a finite Web of Linked Data must also be 
finite, independent of the query that we want to answer and the 



reachability criterion that we use. Consequently, the result of CLD 
queries over such a finite Web is also guaranteed to be finite. We 
shall see that a similarly general statement does not exist when the 
queried Web is infinite such as the WWW. 

To study the implications of querying an infinite Web we first 
take a look at some example queries. For these examples we as- 
sume an infinite Web of Linked Data W\„f = (Djnf , data]nf, adoCi„f) 
that contains LD documents for all natural numbers (similar to the 
documents in Example [TJ. The data in these documents refers to 
the successor of the corresponding number and to all its divisors. 
Hence, for each natural numbeiQfe £ N""", identified by nofc G I, 
exists an LD document adoCi„t (nok) = dk € Anf such that 

datainf(dfe) = |(nOfe,succ, nOfe+i)| U (J |(nOfc, div, nOy)| 

aGDiv(fe) 

where Div(fc) denotes the set of all divisors of fc G N^, succ G I 
identifies the successor relation for N""", and div G I identifies the 
relation that associates a number k G with a divisor y G Div(fc) . 

Example 2. Let Bi = {(no2, succ, ?x)} be a BQP (Ix G V) 
that asks for the successor of 2. Recall, data\„f(d2) contains three 
data triples: (no2, succ, noa), (no2, div, noi), and (no2, div, no2). 
We consider reachability criteria caii, CMatch, ond CNone (cf. Sec- 
tional!^ and Si — {no2}.' The {Si, cm\, Bi)-reachable part of 
W\nf is infinite and consists o£| the LD documents di, ... , dk, ■■■ ■ 
In contrast, the {Si, CMatch, Bi)-reachable part and the 

{Si, CNone, Bi)-reachable part Wi^^^f^^ are finite: con- 
sists of d2 and ds, whereas Wi^^^f^^ only consists of d2. The 
query result in all three cases contains a single solution fi for which 
dom{p) = {?a;} and p{?x) = noa,- i.e. pL = {Ix — > nos}. 

Example 3. We now consider the BQP B2 ~ {(no2, succ, ?x), 
(?a;, succ, ?y), (?2, div, with 7x,?y,7z£V and S2 = {no2}. 
Under CHone-semantics the query result is empty because the {S2, 
CNone, B2)-reachable port of W\uf only consists of LD document 
d2 (as in the previous example). For caii and CMatch the reach- 
able parts are infinite (and equal): Both consist of the documents 
di, ... ,dk, ... (as was the case for cah but not for CMatch in the previ- 
ous example). While the query result is also equal for both criteria, 
it differs significantly from the previous example because it is infi- 
Q^MafnlW^nf) = 2S,'^'(W^nf) = {111,112,... p^,...} where 

pi — {Ix — no3, ?y — >■ no4, Iz — nos}, 
M2 = {Tx — > no3, ly no4, ?z — ^ noe}, 
and, in general: pi — {?x ^ nos, ?y — >■ no4, 7z no(3i)}. 

A special type of CLD queries not covered by the examples are 
queries that use an empty set of seed identifiers. However, it is 
easily verified that answering such queries is trivial: 

Fact 1. Let W be a Web of Linked Data. For each CLD query 
Qc'^ for which 5 = 0, it holds: The set of LD documents in the 
{S, c, B)-reachable part of Wis empty and, thus, Q^'^(W) = 0. 

Due to its triviality, an empty set of seed identifiers presents a spe- 
cial case that we exclude from most of our results. We now sum- 
marize the conclusions that we draw from Examples[2]and[3] 

Proposition 1. Let S <Z X be a finite but nonempty set of seed 
identifiers; let c and c' be reachability criteria; let B be a BQP; 
and let Wbe an infinite Web of Li nked Data. It holds: 
"^In this paper we write N"'" to denote the set of all natural numbers 
without zero. denotes all natural numbers, including zero. 
^We assume succ ^ dom(aciocinf) and div ^ dom(adoCinf ). 



1. Wi^^f ^ " always finite; so is Qf„'f JW^. 

2. //Wi'^'^^ is finite, then Qf'^(W) is finite. 

3. IfQc'^{W) is infinite, then H^i^'^' is infinite. 

4. If c is less restrictive than c and Wc^'^^ is finite, then 
W'-f''^^ is finite. 

J. If c' is less restrictive than c and wi'^'^' is infinite, then 
■^^(■^■B) infinite. 

6. Ifc is less restrictive than c, then Qc'^{W) C Q^,''^{W). 

Proposition [T] provides valuable insight into the dependencies be- 
tween reachability criteria, the (in)finiteness of reachable parts of 
an infinite Web, and the (in)finiteness of query results. In practice, 
however, we are primarily interested in the following questions: 
Does the execution of a given CLD query reach an infinite number 
of LD documents? Do we have to expect an infinite query result? 
We formalize these questions as (LD machine) decision problems: 

Problem: FinitenessReachablePart 

Web Input: a (potentially infinite) Web of Linked Data W 

Ordin. Input: a CLD query Q^'^ where S is nonempty and c 

is less restrictive than CNone 
Question: Is the (S, c, i?)-reachable part of finite? 

Problem: FinitenessQueryResult 

Web Input: a (potentially infinite) Web of Linked Data W 

Ordin. Input: a CLD query Q^'^ where S is nonempty and c 

is less restrictive than CMone 
Question: Is the query result gf "^(ly) finite? | 

Unfortunately, it is impossible to define a general algorithm for an- 
swering these problems as our following result shows. 

Theorem 1. The problems FlNITENESSREACHABLEPARTa«t/ Fl- 
NITENESSQueryResult are not LD machine decidable. 

5.2 Computability of Linked Data Queries 

Example[3]illustrates that some CLD queries may have a result that 
is infinitely large. Even if a query has a finite result it may still 
be necessary to retrieve infinitely many LD documents to ensure 
that the computed result is complete. Hence, any attempt to answer 
such queries completely induces a non-terminating computation. 

In what follows, we formally analyze feasibility and limitations 
for computing CLD queries. For this analysis we adopt notions 
of computability that Abiteboul and Vianu introduce in the context 
of queries over a hypertext-centric view of the WWW fj}. These 
notions are: finitely computable queries, which correspond to the 
traditional notion of computability; and eventually computable que- 
ries whose computation may not terminate but each element of the 
query result will eventually be reported during the computation. 
While Abiteboul and Vianu define these notions of computability 
using their concept of a Web machine (cf. Section [T2] >. our adapta- 
tion for Linked Data queries uses an LD machine: 

Definition 13. A Linked Data query q is finitely computable if 

there exists an LD machine which, for any Web of Linked Data W 
encoded on the Web tape, halts after a finite number of steps and 
produces a possible encoding of q{W) on its output tape. 

Definition 14. A Linked Data q query is eventually computable 

if there exists an LD machine whose computation on any Web of 
Linked Data W encoded on the Web tape has the following two 



properties: 1.) the word on the output tape at each step of the com- 
putation is a prefix of a possible encoding of q{W) and 2.) the 
encoding enc(/x') of any fj.' £ becomes part of the word on 

the output tape after a finite number of computation steps. 

We now analyze the computability of CLD queries. As a prelimi- 
nary we identify a dependency between the computation of a CLD 
query over a particular Web of Linked Data and the (in)finiteness 
of the corresponding reachable part of that Web: 

Lemma 1. The result of a CLD query Qc'^ over a (potentially 
infinite) Web of Linked Data W can be computed by an LD machine 
that halts after a finite number of computation steps if and only if 
the (S, c, B)-reachable part of W is finite. 

The following, immediate consequence of Lemma[T]is trivial. 

Corollary 1. CLD queries that use an empty set of seed identifiers 
and CLD queries under ct^oneSemantics are finitely computable. 

While Corollary [T] covers some special cases, the following result 
identifies the computability of CLD queries in the general case. 

Theorem 2. Each CLD query is either finitely computable or even- 
tually computable. 

Theorem|2]emphasizes that execution systems for CLD queries do 
not have to deal with queries that are not even eventually com- 
putable. Theorem|2]also shows that query computations in the gen- 
eral case are not guaranteed to terminate. The reason for this re- 
sult is the potential infiniteness of Webs of Linked Data. However, 
even if a CLD query is only eventually computable, its computa- 
tion over a particular Web of Linked Data may still terminate (even 
if this Web is infinite). Thus, in practice, we are interested in cri- 
teria that allow us to decide whether a particular query execution is 
guaranteed to terminate. We formalize this decision problem: 

Problem: ComputabilityCLD 

Web Input: a (potentially infinite) Web of Linked Data W 

Ordin. Input: a CLD query Q^'^ where S is nonempty and c 

is less restrictive than cnohc 
Question: Does an LD machine exist that i) computes 
Q^'^{W) and ii) halts? 

Unfortunately: 

Theorem 3. ComputabilityCLD is not LD machine decidable. 

As a consequence of the results in this section we note that any 
system which executes CLD queries over an infinite Web of Linked 
Data (such as the WWW) must be prepared for query executions 
that do not terminate and that discover an infinite amount of data. 

6. QUERY EXECUTION MODEL 

In Section |4] we use a two-phase approach to define (a family of) 
semantics for conjunctive queries over Linked Data. A query ex- 
ecution system that would directly implement this two-phase ap- 
proach would have to retrieve all LD documents before it could 
generate the result for a query. Hence, the first solutions could only 
be generated after all data links (that qualify according to the used 
reachability criterion) have been followed recursively. Retrieving 
the complete set of reachable documents may exceed the resources 
of the execution system or it may take a prohibitively long time; it 
is even possible that this process does not terminate at all (cf. Sec- 
tion |5]2j. The link traversal based query execution that we demon- 
strate in Section [2] applies an alternative strategy: It intertwines 
the link traversal based retrieval of data with a pattern matching 



process that generates solutions incrementally. Due to such an in- 
tegration of link traversal and result construction it is possible to 
report first solutions early, even if not all links have been followed 
and not all data has been retrieved. To describe link traversal based 
query execution formally, we introduce an abstract query execution 
model. In this section we present this model and use it for proving 
soundness and completeness of the modeled approach. 

6.1 Preliminaries 

Usually, queries are executed over a finite structure of data (e.g. an 
instance of a relational schema or an RDF dataset) that is assumed 
to be fully available to the execution system. However, in this paper 
we are concerned with queries over a Web of Linked Data that may 
be infinite and that is fully unknown at the beginning of a query 
execution process. To learn about such a Web we have to deref- 
erence identifiers and parse documents that we retrieve. Concep- 
tually, dereferencing an identifier corresponds to achieving partial 
knowledge of the set D and mapping adoc with which we model 
the queried Web of Linked Data W = [D, data, adoc). Similarly, 
parsing documents retrieved from the Web corresponds to learning 
mapping data. To formally represent what we know about a Web 
of Linked Data at any particular point in a query execution process 
we introduce the concept of discovered parts. 

Definition 15. A discovered part of a Web of Linked Data W is 
an induced subweb of H^that is finite. 

We require finiteness for discovered parts of a Web of Linked Data 
W. This requirement models the fact that we obtain information 
about W only gradually; thus, at any point in a query execution 
process we only know a finite part of W, even if Wis infinite. 

The (link traversal based) execution of a CLD query Q^'^ over a 
Web of Linked Data W = [D, data, adoc) starts with a discovered 
part T)f^^ (of W) which contains only those LD documents from 

that can be retrieved by dereferencing identifiers from S; hence, 
'S)f^^ — {Do,datao, adoco) is defined by: 

-Do ~ {adoc{id) | id G 5 and id G dom(adoc)} (2) 

In the remainder of this section we first define how we may use 
data from a discovered part to construct (partial) solutions for a 
CLD query in an incremental fashion. Furthermore, we formalize 
how the link traversal approach expands such a discovered part in 
order to construct further solutions. Finally, we discuss an abstract 
procedure that formally captures how the approach intertwines the 
expansion of discovered parts with the construction of solutions. 

6.2 Constructing Solutions 

The query execution approach that we aim to capture with our 
query execution model constructs solutions for a query incremen- 
tally (cf. Section[2]l- To formalize the intermediate products of such 
a construction we introduce the concept of partial solutions. 

Definition 16. A partial solution for CLD query Qf ■ in a Web of 

Linked Data Wis a pair (P, p) where P C B and p £ Qc'^{W). 

According to Definition [T6] each partial solution (P, n) for a CLD 
query Qf'^ is a solution for the CLD query Qf^ that uses BQP P 
(instead of B). Since P is a part of B we say that partial solutions 
cover only a part of the queries that we want to answer. 

The (link traversal based) execution of a CLD query over 
a Web of Linked Data W starts with an empty partial solution 
(To = (Po,^io) which covers the empty part Po = of B (i.e. 
dom(/^o) ~ 0)- During query execution we (incrementally) ex- 
tend partial solutions to cover larger parts of B. Those partial so- 
lutions that cover the whole query can be reported as solutions for 



Qc'^ in W. However, to extend a partial solution we may use data 
only from LD documents that we have already discovered. Conse- 
quently, the following definition formalizes the extension of a par- 
tial solution based on a discovered part of a Web of Linked Data. 

Definition 17. Let Wti be a discovered part of a Web of Linked 
Data W; let Qf be a CLD query; and let a = (P, ^) be a partial 
solution for Q^'^ in W. If there exist a triple pattern tp £ B \ P 
and a data triple t £ AllData(W^ji) such that t matches tp then 
the {t, tp) -augmentation of a in W^i, denoted by aug^f^ (a), is a 
pair {P' , fi') such that P' = PU {tp} and fi' extends jj, as follows: 
1.) dom(^') = vars(P') and 2.) ii'[P'] = n[P] U {t}. 

The following proposition shows that the result of augmenting a 
partial solution is again a partial solution, as long as the discov- 
ered part of the Web that we use for such an augmentation is fully 
contained in the reachable part of the Web. 

Proposition 2. Let Ws be a discovered part of a Web of Linlced 
Data Wand let Q^'^ be a CLD query. IfW^i is an induced subweb 
of the {S, c, B)-reachable part of Wand a is a partial solution for 
Qf'^ in W, then augY^, (cr) is also a partial solution for in 
W, for all possible t and tp. 

6.3 Traversing Data Links 

During query execution we may traverse data links to expand the 
discovered part. Such an expansion may allow us to compute fur- 
ther augmentations for partial solutions. The link traversal based 
approach implements such an expansion by dereferencing identi- 
fiers that occur in valuations fi of partial solutions (cf. Section |2j. 
Formally, we define such a valuation based expansion as follows: 

Definition 18. Let W^i = (-Dj), data^,adoc^) be a discov- 
ered part of a Web of Linked Data W = {D, data, adoc) and 
let /i be a valuation. The /i-expansion of Wn in W, denoted by 

ea^Pjf ( W^33 ) , is an induced subweb {D'^ , data'^ , adoc'^ ) of W, de- 
fined by D'j, = Ds,U A^ifi) where 

A^(/i) = {adoc(/i(?u)) I 1v £ dom(/x) 

and fi{?v) G dom(adoc)} 

The following propositions show that expanding discovered parts is 
a monotonic operation (Proposition[3) and that the set of all possi- 
ble discovered parts is closed under this operation (Proposition|4ll. 

Proposition 3. Let Wx) be a discovered part of a Web of Linlced 
Data W, then W^ is an induced subweb of exp^ (Wx)) , for all 
possible fi. 

Proposition 4. Let Wxt be a discovered part of a Web of Linlced 
Data W, then expY(Wxi) is also a discovered part ofW, for all 
possible fi. 

We motivate the expansion of discovered parts of a queried Web 
of Linked Data by the possibility that data obtained from addition- 
ally discovered documents may allow us to construct more (partial) 
solutions. However, Proposition |2] indicates that the augmentation 
of partial solutions is only sound if the discovered part that we use 
for the augmentation is fully contained in the corresponding reach- 
able part of the Web. Thus, in order to use a discovered part that 
has been expanded based on (previously constructed) partial solu- 
tions, it should be guaranteed that the expansion never exceeds the 
reachable part. Under CMatch -semantics we have such a guarantee: 

Proposition 5. Let a = {P, jj.) be a partial solution for a CLD 
query Q^f^,, {under cuAA-semantics) in a Web of Linked Data 



W; and let 

Wil[f2 denote the {S, CMatch, B)-reachable part ofW. 
If a discovered part Wzi of W is an induced subweb of 
then exp'^lyVx)) is also an induced subweb ofWi^[f2- 

We explain the restriction to CMatch -semantics in Proposition |5] as 
follows: During link traversal based query execution we expand 
the discovered part of the queried Web only by using valuations 
that occur in partial solutions (cf. Section|2ll. Due to this approach, 
we only dereference identifiers for which there exists a data triple 
that matches a triple pattern in our query. Hence, this approach in- 
directly enforces query pattern based reachability (cf. Section|4j2j. 
As a result, link traversal based query execution only supports CLD 
queries under CMatch-semantics; so does our query execution model. 

6.4 Combining Construction and Traversal 

Although incrementally expanding the discovered part of the reach- 
able subweb and recursively augmenting partial solutions may be 
understood as separate processes, the idea of link traversal based 
query execution is to combine these two processes. We now intro- 
duce an abstract procedure which captures this idea formally. 

As a basis for our formalization we represent the state of a query 
execution by a pair (^,2)); ^ denotes the (finite) set of partial 
solutions that have already been constructed at the current point in 
the execution process; D denotes the currently discovered part of 
the queried Web of Linked Data. As discussed before, we initialize 
^ with the empty partial solution ao (cf. Section |6j2j and D with 
-^init^ (cf. Section |6Tl l. During the query execution process and 
D grow monotonically: We augment partial solutions from ^ and 
add the results back to ^P; additionally, we use partial solutions 
from ^ to expand S. However, conceptually we combine these 
two types of tasks, augmentation and expansion, into a single type: 

Definition 19. Let Qf^f be a CLD query (under CMatch -seman- 
tics); let ('P, S) represent a state of a (link traversal based) ex- 
ecution of QcuL^- An AE task for S) is a tuple {a,t,tp) 
for which it holds i) a = (P,^) e ^3, ii) t G AllData(D), 
iii) tp <= B\ P, and iv) t matches tp . 

Performing an AE task {a, t, tp) for S) comprises two steps: 
1.) changing to U {(P', fi')}, where (P', /i') = augf^tp{o-) is 
the (t, tp) -augmentation of a in D, and 2.) expanding D to the /i'- 
expansion of D in W. Notice, constructing the augmentation in the 
first step is always possible because the prerequisites for AE tasks, 
as given in Definition [19] correspond to the prerequisites for aug- 
mentations (cf. Definition! 17t. However, not all possible AE tasks 
may actually change qj and S; instead, some tasks {a, t, tp) may 
produce an augmentation augf^p (a) that turns out to be a partial 
solution which has already been produced for another task. Thus, 
to guarantee progress during a query execution process we must 
only perform those AE tasks that produce new augmentations. To 
identify such tasks we introduce the concept of open AE tasks. 

Definition 20. An AE task {a,t, tp) for the state (q!, D) of a link 
traversal based query execution is open if augfj.p (a) ^ qj. To de- 
note the set of all open AE tasks for (q}, S) we write Open (qj, 3)) . 

We now use the introduced concepts to present our abstract proce- 
dure ItbExec (cf. Algorithm[T) with which we formalize the gen- 
eral idea of link traversal based query execution. After initializing 
q? and D (lines [T] and [2] in Algorithm [T), the procedure amounts 
to a continuous execution of open AE tasks. We represent this 
continuous process by a loop (lines[3]to|9j; each iteration of this 
loop performs an open AE task (lines [5] to [TJ and checks whether 
the newly constructed partial solution (P', p') covers the executed 



Algorithm 1 ltbExec(S, B,W)- Report all /i G Qf„,t|,(VV) . 



1; 




2: 




3: 


while Open{^, S) / do 


4: 


Choose open AE task (a, t, tp) £ Upen(yp, V) 


5: 


{P',fJ,') ■■= au5?tp(o-) 


6: 


*P := ^PU {(P',^')} // indirectly ctianges Open (<p, 2)) 


7: 


D := ea;pJ;V(£») 


8: 


it P' — B then report /i' endif 


9: 


end while 



CLD query as a whole, in which case the valuation ^' in (P' 
must be reported as a solution for the query (line (8)1. We empha- 
size that the set Open(^, J)) of all open AE tasks always changes 
when ItbExec performs such a task. The loop terminates when 
no more open AE tasks for (the current) X)) exist (which may 
never be the case as we know from LemmaflJ. 

We emphasize the abstract nature of Algorithm [T] The fact that 
we model ItbExec as a single loop which performs (open) AE tasks 
sequentially, does not imply that the link traversal based query ex- 
ecution paradigm has to be implemented in such a form. Instead, 
different implementation approaches are possible, some of which 
have already been proposed in the literature 161171 11311141 . In con- 
trast to the concrete (implementable) algorithms discussed in this 
earlier work, we understand Algorithm [T] as an instrument for pre- 
senting and for studying the general idea that is common to all link 
traversal based query execution approaches. 

6.5 Application of the Model 

Based on our query execution model we now show that the idea of 
link traversal based query execution is sound and complete, that is, 
the set of all valuations reported by ltbExec{S, B, W) is equiva- 
lent to the query result Qc^^Sy^' Formally: 

Theorem 4. Let Wbe a Web ofLinlced Data and let Qcuftch " 
CLD query (under CMatchSemantics). 

• Soundness: For any valuation fj, reported by an execution of 
ltbExec{S, B, W) holds II € Qf^LuW- ' 

• Completeness: Any fj, € Q^f^|,(W^ will eventually be re- 
ported by any execution ofltbExec{S, B, W). 

Theorem|4]formally verifies the applicability of link traversal based 
query execution for answering conjunctive queries over a Web of 
Linked Data. For experimental evaluations that demonstrate the 
feasibility of link traversal based execution of queries over Linked 
Data on the WWW we refer to (HIT] [131 El- We note, however, 
that the implementation approaches used for these evaluations do 
not allow for an explicit specification of seed identifiers S. Instead, 
these approaches use the identifiers in the BQP of a query as seed 
and, thus, only support CLD queries Qc'^ for which 5* = ids(P). 
TheoremOhighlights that this is a limitation of these particular im- 
plementation approaches and not a general property of link traver- 
sal based query execution. 

In the remainder of this section we use our (abstract) execution 
model to analyze the iterator based implementation of link traversal 
based query execution that we introduce in ||6l|7J. The analysis 
of this implementation approach is particularly interesting because 
this approach trades completeness of query results for the guarantee 
that all query executions terminate as we shall see. 



The implementation approach applies a synchronized pipeline 
of operators that evaluate the BQP B — {tpi, ... , tpn} of a CLD 
query in a fixed order. This pipeline is implemented as a chain of 
iterators /i, ... , J„; iterator Ik is responsible for triple pattern tpk 
(for all 1 < fc < n) from the ordered BQP. While the selection of 
an order for the BQP is an optimization problem IS], we assume 
a given order for the following analysis (in fact, the order is irrel- 
evant for the analysis). Each iterator Ik provides valuations that 
are solutions for CLD query Sf^jf^ where Pk = {tpi, ... ,tpk}. 
To determine these solutions each iterator Ik executes the follow- 
ing four steps repetitively: First, Ik consumes a valuation fi' from 
its direct predecessor and applies this valuation to its triple pattern 
tpk , resulting in a triple pattern tp'k = jj.'[tpk]', second, Ik (tries to) 
generate solutions by finding matching triples for tp'k in the query- 
local dataset; third, Ik uses the generated solutions to expand the 
query-local dataset; and, fourth, Ik (iteratively) reports each of the 
generated solutions. For a more detailed description of this imple- 
mentation approach we refer to (6|. 

In terms of our abstract execution model, each iterator performs 
a particular subset of all possible open AE tasks: For each open 
AE task (cr, £, tp) performed by iterator Ik it holds i) tp = tpk and 
ii) a = (Pk-i,p) where Pk-i = {tpi, ... ,tpk^i}. However, h 
may not perform all (open) AE tasks which have these properties. 

Lemma 2. During an iterator execution of an arbitrary CLD query 
S^fch (^^'^^ CMatchJ over an arbitrary Web of Linked Data W 
it holds: The set of AE tasks performed by each iterator is finite. 

Based on Lemma |2l we easily see that an iterator execution of a 
CLD query Q^f^,, may not perform all possible (open) AE tasks. 
Thus, we may show the following result as a corollary of Lemma|2l 

Theorem 5. Any iterator based execution of a CLD query Q^f^^ 
(that uses CMatchJ over an arbitrary Web of Linked Data W reports 
a finite subset of QcuMcSy^') "''^ terminates. 

TheoremjSlshows that the analyzed implementation of link traversal 
based query execution trades completeness of query results for the 
guarantee that all query executions terminate. The degree to which 
the reported subset of a query result is complete depends on the 
order selected for the BQP of the executed query as our experiments 
in (6) show. A formal analysis of this dependency is part of our 
future work. 

7. RELATED WORK 

Since its emergence the World Wide Web has spawned research to 
adapt declarative query languages for retrieval of information from 
the WWW ID. Most of these works understand the WWW as a 
graph of objects interconnected by hypertext links; in some mod- 
els objects have certain attributes (e.g. title, modification date) 1151 
or an internal structure l5l 1121 . Query languages studied in this 
context allow a user to either ask for specific objects 1121 . for their 
attributes (TSj, or for specific object content 15). However, there is 
no explicit connection between data that may be obtained from dif- 
ferent objects (in contrast to the more recent idea of Linked Data). 
Nonetheless, some of the foundational work such as fT) and 1151 
can be adapted to query execution over a Web of Linked Data. In 
this paper we analyze the computability of CLD queries by adopt- 
ing Abiteboul and Vianu's notions of computability IT], for which 
we have to adapt their machine model of computation on the Web. 

In addition to the early work on Web queries, query execution 
over Linked Data on the WWW has attracted much attention re- 
cently. In (9] we provide an overview of different approaches and 
refer to the relevant literature. However, the only work we are 



aware of that formally captures the concept of Linked Data and pro- 
vides a well-defined semantics for queries in this context is Bouquet 
et al. (3). In contrast to our more abstract, technology-independent 
data model, their focus is Linked Data on the WWW, implemented 
using concrete technologies such as URIs and RDF. They adopt the 
common understanding of a set of RDF triples as graphs j 1 1;|. Con- 
sequently, Bouquet et al. model a Web of Linked Data as a "graph 
space", that is, a set of RDF graphs, each of which is associated 
with a URI that, when dereferenced on the WWW, allows a system 
to obtain that graph. Hence, RDF graphs in Bouquet et al.'s graph 
space correspond to the LD documents in our data model; the URIs 
associated with RDF graphs in a graph space have a role similar to 
that of those identifiers in our data model for which the correspond- 
ing mapping adoc returns an actual LD document (i.e. all identifiers 
in dom (adoc)). Therefore, RDF graphs in a graph space form an- 
other type of (higher level) graph, similar to the Web link graph in 
our model (although. Bouquet et al. do not define that graph explic- 
itly). Based on their data model. Bouquet et al. define three types of 
query methods for conjunctive queries: a bounded method which 
only uses those RDF graphs that are referred to in queries, a navi- 
gational method which corresponds to our query model, and a di- 
rect access method which assumes an oracle that provides all RDF 
graphs which are "relevant" for a given query. For the navigational 
method the authors define a notion of reachability that allows a 
query execution system to follow all data links. Hence, the seman- 
tics of queries using this navigational method is equivalent to CLD 
queries under cam -semantics in our query model. Bouquet et al.'s 
navigational query model does not support other, more restrictive 
notions of reachability, as is possible with our model. Furthermore, 
Bouquet et al. do not discuss the computability of queries and the 
infiniteness of the WWW. 

8. CONCLUSIONS AND FURTHER WORK 

Link traversal based query execution is a novel query execution ap- 
proach tailored to the Web of Linked Data. The ability to discover 
data from unknown sources is its most distinguishing advantage 
over traditional query execution paradigms which assume a fixed 
set of potentially relevant data sources beforehand. In this paper 
we provide a formal foundation for this new approach. 

We introduce a family of well-defined semantics for conjunctive 
Linked Data queries, taking into account the limited data access ca- 
pabilities that are typical for the WWW. We show that the execution 
of such queries may not terminate (cf. Theorem[2l( because -due to 
the existence of data generating servers- the WWW is infinite (at 
any point in time). Moreover, queries may have a result that is in- 
finitely large. We show that it is impossible to provide an algorithm 
for deciding whether any given query (in our model) has a finite 
result (cf. TheoremfTJ. Furthermore, it is also impossible to decide 
(in general) whether a query execution terminates (cf. Theorem[3}, 
even if the expected result would be known to be finite. 

In addition to our query model we introduce an execution model 
that formally captures the link traversal based query execution par- 
adigm. This model abstracts from any particular approach to imple- 
ment this paradigm. Based on this model we prove that the general 
idea of link traversal based query execution is sound and complete 
for conjunctive Linked Data queries (cf. Theorem|4l(. 

Our future work focuses on more expressive types of Linked 
Data queries. In particular, we aim to study which other features 
of query languages such as SPARQL are feasible in the context of 
querying a Web of Linked Data and what the implications of sup- 
porting such features are. Moreover, we will extend our models to 
capture the dynamic nature of the Web and, thus, to study the impli- 
cations of changes in data sources during the execution of a query. 
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APPENDIX 

The Appendix is organized as follows: 

• Appendix |A] describes how we encode relevant structures (such as a Web of Linked Data and a valuation) on the tapes of Turing 
machines. 

• Appendix iBlcontains the full technical proofs for all results in this paper. 

A. ENCODING 

To encode Webs of Linked Data and query results on the tapes of a Turing machine we assume the existence of a total order -<t, -<c, and 
-<v for the identifiers in I, the constants in £, and the variables in V, respectively; in all three cases -<x could simply be the lexicographic 
order of corresponding string representations. Furthermore, we assume a total order -<t for data triples that is based on the aforementioned 
orders. 

For each id I, c £ C, and w G V let enc{id), enc(c), and enc(?;) be the binary representation of id, c, and v, respectively. The encoding 
of a data triple t = (s,p, o), denoted by enc(t), is a word (enc(s), enc(p) , enc(o)). 

The encoding of a finite set of data triples T = {ti, ... , tn}, denoted by enc(T), is a word {( enc(ti) , enc(t2) , ... , enc(tn) )) where the 
enc(ti) are ordered as follows: For each two data triples tx,ty G T, enc{tx) occurs before enc{ty) in enc(T) if tx -<t ty 

For a Web of Linked Data W = (D, data, adoc), the encoding of LD document d £ D, denoted by enc(d), is the word enc{data{d)) . 
The encoding of Witself, denoted by enc(W), is a word 

tt enc{idi) enc{adoc{idi)) [t ... (t enc(idi) enc{adoc{idi)) [t ... 

where idi, idi, ... is the (potentially infinite but countable) list of identifiers in dom{adoc), ordered according to ^i. 
The encoding of a valuation fi with dom{fi) = {t>i, ... , v„}, denoted by enc(/i), is a word 

((enc(wi) -> enc(^(i;i)) , ... , enc(w„) enc[^^{v„)) )) 

where the enc(/i(wi)) are ordered as follows: For each two variables i}x, Vy £ dom{ii), enc{jj,{vx)) occurs before enc{fi{i}y)) in enc(/i) if 

Vx -<V Vy 

Finally, the encoding of a (potentially infinite) set of valuations Q — {fii, /12, ■.■}, denoted by enc(f7), is a word enc(^i) enc(^2) ... 
where the enc(/ii) may occur in any order. 

B. PROOFS 

B.l Additional References for the Proofs 

[Pap93] C. H. Papadimitriou. Computational Complexity. Addison Wesley, 1993. 

B.2 Proof of Proposition [1] 

Let: 

• 5 C X be a finite but nonempty set of seed identifiers; 

• c and c' be reachability criteria; 

• B be a BQP such that Qf and Q^"^ are CLD queries; 

• W = {D, data, adoc) be an infinite Web of Linked Data. 

m is always finite; so is Sf^Kw). 

Let D<rt denote the set of all LD documents in Since CNone always returns false it is easily verified that there is no LD document 

d £ D that satisfies case [2] in Definition 1 101 Hence, it must hold — {adoc{id) | id £ S and id £ dom(acioc)} (cf. case [T] in 

Definition! 1 Oil. Since S is finite we see that Dr^ is guaranteed to be finite (and so is M^^oi^')- finiteness of Qf^^^JW) can then be shown 
based on Proposition[T] case|2] 

H If W^i^'-^' is finite, tlien Qf '^(W^ is finite. 

(SB) /(SB) 

If Wc ' is finite, there exist only a finite number of different possible subsets of AllData(Wc ' ). Hence, there can only be a finite 
number of different valuations fi with ij.[B] C AllData(M^i^'^'). 

H If Qf ^(W) is infinite, tlien W^i^^^ is infinite. 

If Qc'^{W) is infinite, we have infinitely many valuations ^ £ Qc'^{W) . For each of them exists a unique subset ^[B] C AllData(wi^'^^) 
(cf. Definitionll2t. Hence, there are infinitely many such subsets. Thus, Wc^'^'' must be infinite. 

m If c is less restrictive than c' and Wc^'^^ is finite, then W^f'^'' is finite. 

If Wc^'^'' is finite, then exists finitely many LD documents d £ D that are (c, i3)-reachable from S mW. A subset of them is also 
(c', -B)-reachable from 5* in M^because c is less restrictive than c'. Hence, VKj,^'^' must also be finite. 



HJ If c' is less restrictive than c and Wc^'^^ is infinite, tlien W^f'^^ is infinite. 

If wi^'^' is infinite, then exists infinitely many LD documents d G D that are (c, _B)-reachable from S in W. Each of them is also 
(c', i3)-reachable from S in Wbecause c' is less restrictive than c. Hence, wj,f'^^ must also be infinite. 

m If c' is less restrictive than c, then Qf-^{W} C q5"^(M^. 

If c' is less restrictive than c, then each LD document d £ D that is (c, _B)-reachable from S in M^is also (c', _B)-reachable from 5* in W. 
Hence, AllData(W^i^'-^') C AllData(W^if and, thus, Qf '^(W) C Qf^"^(W^. 

B.3 Proof of Theorem I] 

We prove the theorem by reducing the halting problem to FinitenessReachablePart and to FinitenessQueryResult. 

The halting problem asks whether a given Turing machine (TM) halts on a given input. For the reduction we assume an infinite Web 
of Linked Data VKtms which we define in the following. Informally, Wtms describes all possible computations of all TMs. For a formal 
definition of Wtms we adopt the usual approach to unambiguously describe TMs and their input by finite words over the (finite) alphabet 
of a universal TM (e.g. [Pap93]). Let W be the countably infinite set of all words that describe TMs. For each lii £ W let AI{w) denote 
the machine described by w and let c™'^ denote the computation of M{w) on input x. Furthermore, let id™'^ denote an identifier for the 
j-th step in c""'^. To denote the (infinite) set of all such identifiers we write IxMsteps- Using the identifiers XxMsteps we may unambiguously 
identify each step in each possible computation of any TM on any given input. However, if an identifier id £ X could potentially identify a 
computation step of a TM on some input (because id adheres to the pattern used for such identifiers) but the corresponding step may never 
exist, then id ^ IjMsteps- For instance, if the computation of a particular TM M{wj) on a particular input Xk halts with the i'-th step, then 
Vi £ {1, ... , i'} : id^'' £ IrMsteps and Vi £ {i' + l, ■■■} : idj ^' * ^ TjMsteps- Notice, while the setXjMsteps is infinite, it is still countable 
because i) W is countably infinite, ii) the set of all possible input words for TMs is countably infinite, and iii) i is a natural number 

We now define Wjms as a Web of Linked Data {Djms, datajM^, adocjMs) with the following elements: Djms consists of |lTMsteps| 
different LD documents, each of which corresponds to one of the identifiers in XrMsteps (and, thus, to a particular step in a particular 
computation of a particular TM). Mapping adocjMs is bijective and maps each idf'^ £ IjMsteps to the corresponding d™'"^ £ Dtms; 
hence, dom(adocTMs) ~ IjMsteps- We emphasize that mapping adocjMs is (Turing) computable because a universal TM may determine 
by simulation whether the computation of a particular TM on a particular input halts before a particular number of steps (i.e. whether the 
i-th step in computation c™'^ for a given identifier id™'^ may actually exist). Finally, mapping datajus is computed as follows: The set 
datajus{d^'^^ of data triples for an LD document d^'^ is empty if and only if c™'^ halts with the i-th computation step. Otherwise, 
dotaTMs(d™'^) contains a single data triple (idj"'^, next, id^J^^) which associates the computation step id^'^ with the next step in c^'^ (next 
denotes an identifier for this relationship). Formally: 

I jw x\ J if c™'^ halts with the i-th computation step, 

insxtjid^^j^)} else. 

We emphasize that mapping dafajMs is also (Turing) computable because a universal TM may determine by simulation whether the compu- 
tation of a particular TM on a particular input halts after a given number of steps. 

Before we come to the reduction we highlight a property of Wjus that is important for our proof. Each data triple {id™'^, next, id™_^^) 
establishes a data link from d|"'^ to d^J^^. Due to such links we recursively may reach all LD documents about all steps in a particular 
computation of any TM. Hence, for each possible computation c"''^ we have a (potentially infinite) simple path (d™'^, ... ,d^'^, ...) in the 
Web link graph of Wjms- Each such path is finite iff the corresponding computation halts. Finally, we note that each of these paths forms a 
separate subgraph of the Web link graph of Wjms because we use a separate set of step identifiers and LD documents for each computation. 

We now reduce the halting problem to FinitenessReachablePart. The input to the halting problem is a pair {w, x) consisting of a 
TM description w and a possible input word x. For the reduction we need a computable mapping /i that, given such a pair (WjX), produces a 
tuple {W,S,c,B) as input for FINITENESSREACHABLEPART. We define /i as follows: Let w be the description of a TM, let x be a possible 
input word for M{w), and let ?u £ V be an arbitrary query variable, then fi{w,x) = [Wtms, Sw,x,CAn, B^^x) where Sw,x ~ {id^'^} 
and Bw.x = {{id^'^ , next, Given that caii and Wjms are independent of (w, x), it can be easily seen that /i is computable by TMs 

(including LD machines). 

To show that FINITENESSREACHABLEPART is not LD machine decidable, suppose it were LD machine decidable. In such a case an LD 
machine could answer the halting problem for any input {w, x) as follows: AI{w) halts on x if and only if the {Sw.x,cm\, i3u,,a;)-reachable 
part of Wtms is finite. However, we know the halting problem is undecidable for TMs (which includes LD machines). Hence, we have a 
contradiction and, thus, FINITENESSREACHABLEPART cannot be LD machine decidable. 

The proof that FinitenessQueryResult is not LD machine decidable is similar to that for FINITENESSREACHABLEPART. Hence, 
we only outline the idea: Instead of reducing the halting problem to FINITENESSREACHABLEPART based on mapping /i, we now re- 
duce the halting problem to FINITENESSQUERYRESULT using a mapping /2 that differs from /i in the BQP it generates: f2{w,x) = 
(Wtms, Sw,x, Cm, 5(^, 3.) where B'^^ — {(id^'^ , next, ?x), (?y, next, ?z)}. Notice, the two triple patterns in B'^ ,^ have no variable in 
common. If FINITENESSQUERYRESULT were LD machine decidable then an LD machine could answer the halting problem for any {w, x): 

M{w) halts on x if and only if Qca^'^ '^"''^(VPtms) is finite. 



B.4 Proof of Lemma [T] 

As a preliminary to prove Lemma[T]we introduce a specific LD machine for CLD queries: 



Definition 21. Let Qf be a CLD query. The {B, S, c)-machine is an LD machine that implements Algorithm[2] This algorithm makes 
use of a subroutine called lookup. This subroutine, when called with an identifier id £ I, i) writes enc(id) to the right end of the word on 
the link traversal tape, ii) enters the expand state, and iii) performs the expand operation as specified in Definition|4] 



Algoritlini 2 The program of a {B, S, c)-machine. 
1: Call lookup for each id G S. 

2: for j = 1,2, ... do 

3: Let Tj denote the set of all data triples currently encoded on the link traversal tape. Use the work tape to enumerate a set Qj that 

contains all valuations n for which dom(/i) — va,rs{B) and j-L[B] C Tj. 
4: For each /i G Qj check whether /i is already encoded on the output tape; if not, then add enc(^) to the output. 

5: Scan the link traversal tape for an data triple t that contains a identifier id G ids(f) such that i) c{t, id, P) = true and ii) the word on 
the link traversal tape neither contains enc(id) enc{adoc{id)) [t nor enc{id) [t. If such t and id exist, call lookup for id; otherwise 
halt the computation. 

6: end for 



As can be seen in Algorithmic] the computation of each {B, S, c)-machine (with a Web of Linked Data W encoded on its Web tape) starts 
with an initialization (cf. linelTJ. After the initialization, the machine enters a (potentially non-terminating) loop. During each iteration of this 
loop, the machine generates valuations using all data that is currently encoded on the link traversal tape. The following proposition shows 
that these valuations are part of the corresponding query result (find the proof for Proposition|6]below in Section lB3] l: 

Proposition 6. Let M^^''^''^^ be a {^B, S,c) -machine with a Web of Linked Data W encoded on its Web tape. During the execution of 
Algorithm^by M'^'^'^-''^ itholdsMj G {1,2,...} : fij C Q^'^{W). 

Proposition |6] presents the basis to prove the soundness of query results computed by Algorithmic] To verify the completeness of these 
results it is important to note that [B, S, c)-machines look up no more than one identifier per iteration (cf. line|5]in Algorithm |2]l. Hence, 
{B, S, c)-machines prioritize result construction over link traversal. Due to this feature we show that for each solution in a query result exists 
an iteration during which that solution is computed (find the proof for Proposition|7]below in Section|B]6j: 

Proposition 7. Let m'^'"^"'^' be a {B, S, c)-tnachine with a Web of Linked Data W encoded on its Web tape. For each p G Qf '^(W) exists 
a jf_i G {1, 2, ...} such that during the execution ofAIgorithm\2\by m'^'^''^-' it holds V j G {ji^, jii + i, •■•} : P £ ^j- 

So far our results verify that i) the set of query solutions computed after any iteration is sound and ii) that this set is complete after a particular 
(potentially infinite) number of iterations. We now show that the computation definitely reaches each iteration after a finite number of 
computation steps (find the proof for Proposition|8]below in Section lBTTt : 

Proposition 8. Let m'^'^'"^' be a {B, S, c)-machine with a Web of Linked Data W encoded on its Web tape. For any possible iteration it of 
the main processing loop in Algorithm^it requires only a finite number of computation steps before M^^'^'"^^ starts it. 

We now prove Lemma|T] Let: 

• W = {D, data, adoc) be a potentially infinite Web of Linked Data; 

• Qf'^ be a CLD query; and 

• Wc^'^^ — {Dy{,datay{, adocg{) denote the {S, c, i3)-reachable part of W. 

If: Let Wc^'^^ be finite. Hence, Q^'^(W) is finite as well (cf. Proposition |T). We have to show that there exists an LD machine that 
computes '^(W) and halts after a finite number of computation steps. Based on Propositions |6] to |8] it is easy to verify that that the 

{B, S, c)-machine (with enc(M^) on its Web tape) is such a machine: It computes Q^'^[W) and it is guaranteed to halt because Wc^'^^ is 
finite. 

Only if: W.l.o.g., let M be an LD machine (not necessarily a (B, S, c)-machine) that computes Qc''^{W) and halts after a finite number of 

(SB) (SB) 

computation steps. We have to show that Wc ' is finite. We show this by contradiction, that is, we assume Wc ' is infinite. In this case 
Dtj^ is infinite. Since M computes Qc'^{W), M must (recursively) expand the word on its link traversal tape until it contains the encodings 
of (at least) each LD document in Dm- Such an expansion is necessary to ensure that the computed query result is complete. Since D<n is 
infinite the expansion requires infinitely many computing steps. However, we know that M halts after a finite number of computation steps. 
Hence, we have a contradiction and, thus, Wc^'^^ must be finite. 

B.5 Proof of Proposition |6] 

Let: 

• W — {D, data, adoc) be a Web of Linked Data; 

• M^^-^-"^ be a {B, S, c)-machine (cf. Definition|2T]l with enc(W) on its Web tape; 

• Wc^'^^ denote the {S, c, _B)-reachable part of W. 



To prove Proposition|6]we use the following result. 

Lemma 3. During the execution of Algorithm^by M'^'^''=' on (Web) input euc{W) it holds V j G {1, 2, ...} : C AllData(wi^'^'). 

Proof of Lemma |3l Let Wj be the word on the link traversal tape of M^^'^'"^^ when the j-th iteration of the main processing loop in 
Algorithm|2](i.e. lines[2]to|6j starts. 

To prove Vj G {1, 2, ...} : Tj C AllData(wi'^'^'') it is sufficient to show for each Wj (where j G {1, 2, ...}) exists a finite sequence 
idi, ... , idnj of rij different identifiers (V i G [1 , ] : idi G I) such that i) Wj i j3 

enc(icii) enc{adoc{idi)) [t ... (t enc(id„^. ) enc(adoc(id„^. )) [t 

and ii) for each i G [1, rij] either idi ^ dom{adoc) (and, thus, adoc(idi) is undefined) or adoc{idi) is an LD document which is (c, B)- 
reachable from 5* in W. We use an induction over j for this proof. 

Base case {j — 1): The computation of M^^'^''^^ starts with an empty link traversal tape. Due to the initialization, wi is a concatenation 
of sub-words enc(id) enc{adoc{id)) ft for all id £ S (cf. line[T]in Algorithm |2]l. Hence, we have a corresponding sequence idi , ■•■ , idm 
where ni = \S\ and Vi G [1, ni] : idi G S. The order of the identifiers in that sequence depends on the order in which they have been 
looked up and is irrelevant for our proof. For all id G 5* it holds either idi ^ dom(adoc) or adoc{id) is (c, i3)-reachable from S mW 
(cf. case[T]in Definition! lOt. 

Induction step (j > 1): Our inductive hypothesis is that there exists a finite sequence id\, ... ,idnj_-^ of Uj-i different identifiers (Vi G 

[1, rij-i] : idi G I) such that i) Wj-i is 

enc(idi) enc(adoc(idi)) [t ... [t enc(id,i^._ ^ ) enc(adoc(id„^._ ^ )) [t 

and ii) for each i G [1, '^j-i] either idi ^ dom(adoc) or adoc(idi) is (c, _B)-reachable from S in W. In the (j-l)-th iteration the [B, S, c)- 
machine finds a data triple d encoded as part of Wj-i such that 3 id G ids(t) : c{t, id, B) = true and lookup has not been called for id. The 
machine calls lookup for id, which changes the word on the link traversal tape to Wj. Hence, Wj is equal to Wj-i enc(id) enc(adoc(id)) [t 
and, thus, our sequence of identifiers for Wj is idi , ■•■ , id„-_-^ , id. It remains to show that if idi G dom(adoc) then adoc{id) is (c, B)- 
reachable from S in W. 

Assume id G dom(adoc). Since data triple t is encoded as part of Wj-i we know, from our inductive hypothesis, that t must be contained 
in the data of an LD document d* that is (c, i3)-reachable from S in W^(and for which exists i G [1, J^j-i] such that adociidi) = d*). 
Therefore, t and id satisfy the requirements as given in case[2]of Definition[lO]and, thus, adoc{id) is (c, _B)-reachable from S mW. □ 

Proposition|6]is an immediate consequence of Lemma[3] 

B.6 Proof of Proposition |7] 

Let: 

• W = [D, data, adoc) be a Web of Linked Data; 

• M^^-^-"^ be a [B, S, c)-machine (cf. Definition|2T]l with enc(W^) on its Web tape; 

• W^i^'^' denote the {S, c, B)-reachable part of W. 
To prove Proposition|7]we use the following result. 

Lemma 4. For each data triple t G AllData(wi^'^^) exists a jt G {1, 2, ...} such that during the execution of Algorithm\2\by M^^'^''^^ it 
holdsyje{jt,jt + l,...} : t G Tj. 

Proof of Lemma|4] Let Wj be the word on the link traversal tape of Af'^'^''^^ when M^^'^''^^ starts the j-th iteration of the main processing 
loop in Algorithm|2](i.e. lines[2]to|6ll. 

W.l.o.g., let t' be an arbitrary data triple t' G AllData(wi^'^'') . There must exist an LD document d £ D such that i) t' G data{d) 
and ii) d is (c, i3)-reachable from 5* in W. Let d' be such a document. Since m'^''^''^^ only appends to the link traversal tape we prove that 
there exists a jt' G {1, 2, ...} with V j G {jt' , jt' + 1, •■•} : t' G Tj by showing that there exists jt' G {1, 2, ...} such that wj^, contains the 
sub-word enc(d'). 

Since d' is (c, _B)-reachable from S in W, the Web link graph for W^contains at least one finite path (do, ... , d„) of LD documents di 
where i) n G {0, 1, ...}, i) 3 id G 5* : adociid) — do, ii) d„ = d', and iii) for each i G {1, ... , n} it holds: 

3t G data(di-i) : ^3id G ids(t) : (^adoc{id) = di and c{t, id, B) — true)^ (3) 

Let (d^, ... , d* ) be such a path. We use this path for our proof. More precisely, we show by induction over i G {0, n} that there exists 
jt G {1, 2, ...} such that Wj^ contains the sub-word enc(dJi) (which is the same as enc(d') because d^ ~ d'). 

Base case (i = 0): Since 3 id G 5* : adoc(id) — d^ it is easy to verify that wi contains the sub-word enc(d5). 

Induction step (i > 0): Our inductive hypothesis is: There exists j G {1, 2, ...} such that Wj contains sub-word enc(d*„i). Based on this 
hypothesis we show that there exists a j' G {j, j + •.•} such that Wj' contains the sub-word enc(d*). We distinguish two cases: either 

*We assume enc(adoc(idi)) is the empty word if adoc{idi) is undefined (i.e. idi ^ dom(adoc)). 



enc(d*) is already contained in Wj or it is not contained in Wj. In tiie first case we liave j' = j; in tlie latter case we have j' > j. We have 
to discuss the latter case only. 

Due to (O exist t* G data{d*_{) and id* £ ids(f*) such that adoc{id*) = d* and c{t* , id* , B) = true. Hence, there exists a (5 G N" 
such that M^^'^''^^ finds t* and id* in the (_j+5)-th iteration (of. line[5]in Algorithm[2}- Since Af'^'^''^-' calls lookup for id* in that iteration, 
it holds that Wj+j+i contains enc(d*) and, thus, j' = j + S + 1. □ 

Proposition|7]is an immediate consequence of Lemma|4] 

B.7 Proof of Proposition l 

Let: 

• Wbe a Web of Linked Data; 

• JVf (fl S' c) be a {B, S, c)-machine (cf. Definition |2T]l with enc{W) on its Web tape. 

To show that it requires only a finite number of computation steps before A/'^'^''^^ starts any possible iteration of the main processing loop 
in Algorithm [2] we first emphasize that 1.) each call of the subroutine lookup terminates because the encoding of Wis ordered following 
the order of the identifiers used in VTand that 2.) the initialization in line[T]of Algorithm|2]finishes after a finite number of computation steps 
because S is finite. 

Hence, it remains to show that each iteration of the loop also finishes after a finite number of computation steps: Let ra denote the word on 
the link traversal tape at any point in the computation, in is always finite because Af'^'^''^' only gradually appends (encoded) LD documents 
to the link traversal tape (one document per iteration) and the encoding of each document is finite (recall the set of data triples data{d) for 
each LD document d is finite). Due to the finiteness of Id, each (for j — 1, 2, ...) is finite, resulting in a finite number of computation 
steps for lines|3]and|4]during any iteration. The scan in line[5]also finishes after a finite number of computation steps because m is finite. 

B.8 Proof of Corollary [E 

Corollary [T] immediately follows from Lemma [T] and Fact [T] (for CLD queries that use an empty set S of seed identifiers) as well as from 
Lemma[T]and Proposition[T] case[T](for CLD queries under CMone-semantics). 

B.9 Proof of Theorem |2] 

To prove the theorem we only have to show that all CLD queries are at least eventually computable. Corollary[T]shows that some of them are 
even finitely computable. 

To show that all CLD queries (using any possible reachability criterion) are at least eventually computable we use the notion of a {B, S, cj- 
machine (cf. Definition [21] in Section |R4t and show that all computations of {B, S, c)-machines have the two properties as prescribed in 
Definition[T4l 

W.l.o.g., let Af'^'^''^' be an arbitrary {B, S, c)-machine with an arbitrary Web of Linked Data Wencoded on its Web tape; let wi^'^' 
be the {S, c, i3)-reachable part of W. During the computation, Af^^'"^'"^^ only writes to its output tape when it adds (encoded) valuations 
fj, € iij (for j — 1, 2, ...). Since all these valuations are solutions for Qc'^ in W(cf. Proposition|6]in Section lR4t and line|4]in Algorithmic] 
ensures that the output is free of duplicates, we see that the word on the output tape is always a prefix of a possible encoding of '^(W) . 
Hence, the computation of M^^'^'"^^ has the first property specified in Definition[T4] Property |2] readily follows from Propositions |7] and [8] 
(cf. SectionlR4t. 

B.IO Proof of Theorem |3] 

We prove the theorem by reducing FinitenessReachablePart to ComputabilityCLD. For the reduction we use an identity func- 
tion /a that, for any Web of Linked Data W, set S C I of seed identifiers, reachability criterion c, and BQP B, is defined as follows: 
/s (W, S, c, B) = {W, S, c, B). Obviously, fs is computable by TMs (including LD machines). 

To obtain a contradiction, we assume that COMPUTABILITYCLD is LD machine decidable. If that were the case an LD machine could 
immediately use Lemma[T]to answer FinitenessReachablePart for any (potentially infinite) Web of Linked Data Wand CLD query 
Qc'^ where 5* is nonempty and c is less restrictive than CNone- Since we know FINITENESSREACHABLEPART is not LD machine decidable 
(cf. Theorem[T]l we have a contradiction. 

B. 1 1 Proof of Proposition H 

Let: 

• Wbe a Web of Linked Data; 

• Qf "^ be a CLD query; 

• wi^'^' denote the {S, c, i?)-reachable part of W; 

• Wj) be a discovered part of Wand an induced subweb of Wc ' ; 

• a — (P, fi) be a partial solution for Q^'^ in W; and 

• a' — {P' , /i') be a {t, fp) -augmentation of a in Wj). 

To show that a' is a partial solution for in W, we have to show: (1) P' <Z B and (2) fi' is a solution for CLD query in W 

(cf. Definition[T6). 



(1) holds because i) a = {P, fi) is a partial solution for Qf in M^and, thus, P C B, and ii) P' = P U {tp} with tp e B \ P 
(cf. DefinitionflTt. 

To show (2) we note that dom(/i') = vars(P') (cf. Definition [17) ■ It remains to show ii'[P'] C AllData(wi^'^^) (cf. Definition [T2]l. 
Due to Definition [TT] we have fJ.'[P'] = fJ.[P] U {t} with t £ AllData(VKD). It holds t G AllData(Wi^'^') because Ws is an induced 
subweb of W'i'^'^' and, therefore, A]lDa.ta{W^) C AllData(Tyi^'-^>). Furthermore, ii[P] C AllData(H/i^'^') because {P, fi) is a 
partial solution for Qf in W^and, thus, is a solution for ^ in W. Therefore, ii'[P'] C AllData(VKi'^'^'). 

B.12 Proof of Proposition |3] 

Let: 

• W — {D, data, adoc) be a Web of Linked Data; 

• W-B — (D^ , dataxi, adoc-s) be a discovered part of W; 

• ^ be a valuation; and 

• W^s) = (D'D 1 data'j,, adoc'^) be the /i-expansion of W^i- 

To show that W33 is an induced subweb of we have show that W'S} satisfies the three requirements in Definition[3]w.r.t. . 
For requireinent[l]we have to show C , which holds because — _Dji U (fi) (cf. Definition! 18t. 
For requirementlHwe have to show: 

Vd £ D'x ■ data'x (d) = data:!, [d) (4) 
Since is an induced subweb of M^(cf. Definition! 18t it holds: 

Vd £ Dj, : data-iy{d) — data{d) 

and with D^} C (which we have shown before): 

Vd £ Dx) : data'x){d) — data{d) 
Wx> is also an induced subweb of M^(cf. Definition!18t. Hence: 

Vd £ Dd : datax){d) = data{d) 

and, thus, holds 

For requireinent|3]we have to show: 

yid£ {id G X I adocjj (id) £ D^t } : adoc-s (id) = adoc'x (id) 
Since Wj) is an induced subweb of M^(cf. Definition!15t it holds: 

yid£ {id G X\ adoc{id) £ D^t} : adoc-siid) = adoc{id) (5) 
Furthermore, is an induced subweb of M^(cf. Definition! 18t. Hence: 

yidG {id £ I\ adoc(id) £ Dj)} : adoCx){id) = adoc{id) 
Since Dj) C _Dg (which we have shown before) we rewrite ^ by using adoc'x, instead of adoc: 

yid£ {id £ X I adocx) (id) £ Dji } : adoc^i (id) = adoc'j, (id) 

B. 13 Proof of Proposition |4] 

Let: 

• = (D, data, adoc) be a Web of Linked Data; 

• Wj) — (Dj) , dataxt , adoc^i ) be a discovered part of W; 

• be a valuation; and 

• W^i) — (^£1 1 data'x, adoc'x) be the /^-expansion of Wsi- 

To show that is a discovered part of M^we have to show that is finite (cf. Definition! 15t, which holds iff is finite. 

We have D^, = Dj, U A^(/i) (cf. Definition [Tst. D^, is finite because Wji is a discovered part of W. A^(/i) is also finite because it 
contains at most as many elements as we have variables in dom(/i), which is always a finite number. 



B.14 Proof of Proposition |5] 

Let: 

• W — {D, data, adoc) be a Web of Linked Data; 

• Q^f^i, be a CLD query (under CMatch -semantics); 

• ^CMa'tfh^ ~ (^Wi data^i, adocy{) denote the (5, CMatch, -B)-reachable part of W; 

• Wz> = (-Ds) , data's , adoC's ) be a discovered part of and an induced subweb of Wi^,,^^ ; 

• a — (P, fi) be a partial solution for Q^^^f^,, in W; and 

• = (Dj) , data'^s , adoc'^ ) be the /i-expansion of . 

To show that is an induced subweb of we have show that W^^ satisfies the three requirements in Definition [3] with respect to 

For requirement[T]we have to show D'^, C D<r. Due to Definition[T8]we have D'j, = _Dd U A^(/i). It also holds Dd C because Wd 
is an induced subweb of Wc^^^^ . Hence, it remains to show A'''^ (n) C D^. We show A'^(/i) C D^j by contradiction, that is, we assume 

3d G A'-'-'(^) : d ^ A-H. 

According to the definition of A^{ii) must exist v' G dom(/i) such that G T and adoc(^fi{v')) — d (cf. Definition! 1 St. 

Since cr = (-P, /i) is a partial solution for Q^f^,, in W, we know that /x is a solution for Qc^ftch ^^(cf- Definition 1 16t and, thus, 
fi[P] C AllData(Wi|^3,^h') ('-f- Definition II 2t. Together with v' G dom(/i) (see above) we have 3tp' G P : v' £ vars(tp') and 
3 G AllData(wif^^f|,^) : fj,[tp'] = t'. Since n{v') G I and u' G vars(tp') it must hold ii{v') G ids(t'). 

Because of t' G AllData(VKi||J^,^|,') we also have 3d' G -Dm : t' G data{d'). Notice, d' is (cMatch, B)-reachable (from 5* in W). 
Furthermore, it must hold CMatch {t' , I^W),B) — true because t' matches tp' £ P (- B. 

Putting everything together, we have d' £ D,t' £ data{d'), and j-i{v') £ ids(f'), and we know that i) d' is (cMatch, -B)-reachable from 5* 
in W, ii) CMatch (i', ^) = true, and iii) adoc(^jj,{v')^ = d. Thus, d must be (cMatch, -B)-reachable from S in W(cf. Definition|18ll; i.e. 

d G D<n. This contradicts our assumption d ^ Dm. 

We omit showing that Wxi satisfies requirement |2] and requirement|3]w.r.t. Wc^^^^j; the proof ideas are the same as those that we use in 
the proof of Proposition[3](cf. |B.12t . 

B. 15 Proof of Theorem U 

As a basis for proving the soundness we use the following lemma, which may be verified based on Propositions |2] |4] and[5](find the proof 
for the following lemma below in Section |B.16l l. 

Lemma 5. Let Wbe a Web of Linked Data and let Q^f^,, be a CLD query. During an (arbitrary) execution oflthExec{S, B, W) it always 
holds: i) each a £ ^ is a partial solution for Q^f^,, in W and ii) 2) is a discovered part of Wand an induced subweb ofWc^l^^. 

Analogous to Lemma|5] the following lemma provides the basis for our proof of completeness (find the proof for the following lemma below 
in Sections IbTTtI and ISTTSt . 

Lemma 6. Let W = [D, data, adoc) be a Web of Linked Data and let S^f^^ be a CLD query, i) For each d £ D that is (cMatch , B)-reach- 
ablefrom S in W there will eventually be an iteration in any execution ofltbExec{S, B, W) after which d is part ofS. ii) For each partial 
solution a that may exist for Qc^Mh ^ there will eventually be an iteration in any execution ofltbExec{S, B, W) after which a £^. 

We now use Lemmas [5] and [6] to prove Theorem|4] Let: 

• Wbe a Web of Linked Data; 

query (under CMatch-semantics); 

• Wcu^^f^ denote the {S, CMatch , -B)-reachable part of W; 

• be the set of partial solutions (for Qf^jf^,, in W) that is used in ltbExec{S, B, W); and 

• S be the discovered part of Wthat is used in ltbExec{S, B, W). 

Soundness: W.l.o.g., let /i* be a valuation that an arbitrary execution of ltbExec{S, B, W) reports in some iteration itj. We have to show 
/i* G Qcuftchi'^- 1^* originates from the pair [P*, fi") that the execution of ltbExec{S, B, W) constructs and adds to ^ in iteration itj. 
Since (P* /i*) is a partial solution for Qc^f^^^ in W (cf. Lemma[5]l and IthExec reports /i* only if P* = P (cf. line[8]in Algorithm [TJ, it 
holds that ^* is a solution for Qf^^f^,, in 'W(cf. Definition[T6t: i.e. ^* G Qc^t^Jy^- 

Completeness: W.l.o.g., let fj,* be an arbitrary solution for Q^f^,, in W\ i.e. p* £ Qcf^iJy^- We have to show that any execution of 
ltbExec{S, B, W) will eventually report fi* . For ^* exists a partial solution a* = (P* /x*) (for Q^f^,, in W) such that P* — B. Due to 
Lemma|^we know that during any execution of ltbExec{S, B, W) there will be an iteration in which this partial solution a* is constructed 
and added to ^p. This iteration will report fi* because P* — B (cf. line[8]in Algorithm[T). 



B.16 Proof of Lemma |5] 

Let Whe a Web of Linked Data and let Q^^^f^^ be a CLD query (under CMatch-semantics). We show Lemma|5]by induction over the iterations 
of the main processing loop (lines|3]to|9]in Algorithm[T) in ltbExec{S, B, W). 

Base case (i = 0): Before the first iteration, ltbExec{S, B, W) initializes ^ as a set containing a single element: ao — (Po, Mo) where 
Po ~ (cf. line[T]in Algorithm[T](. ao is a partial solution for Q^f^^ in M^because it holds: 

• PoQB, 

• dom(/^o) ~ ~ vars(Po), and 

. Mo[Po] = 0C AllData(VyiJ;fJ). 
J) is initialized with T)'^^^^ = (Do, datao, adoco) (cf. line[2]in Algorithm [T]!. Recall the definition of Do (cf. ^ in Section l6Tt : 

Do — {adoc{id) \ id £ S and id G dom(adoc)} 

Hence, Do contains at most IS*] LD documents. Therefore, Df^^ is finite and, thus, a discovered part of W. is also an induced subweb 

of Wcu^!^h because each d £ Do satisfies case[T]in Definition! 101 

Induction step (i > 0): Our inductive hypothesis is that after the (j-l)-th iteration it holds i) each a G *p is a partial solution for Q^f_,^ 

in Wand ii) X) is a discovered part of M^and an induced subweb of We show that these two assumptions still hold after the z-th 

iteration. Let (a, t, tp) be the open AE task selected in the i-th iteration (cf. line|4]in Algorithm[T]l. 

ltbExec{S, B, W) extends ^ by adding [P' , ji) (cf. line[6ll, the {t, tp) -augmentation of a in D. According to Proposition|2l {P' , is 
a partial solution for Q^f^,, in W because D is an induced subweb of Wc^^^i^ (inductive hypothesis) and ct is a partial solution for Q^f^^ 
in W(cf. Definition[T9t. " 

Furthermore, the result of the ^'-expansion exp^(Ti) of D becomes the new D (cf. line|7J. According to Proposition O exp^(T)) is 
again a discovered part of W; and, according to Proposition[5] it is also an induced subweb of Wc^!^^^^ . 

B. 17 Proof of Assertion i) in Lemma |6] 

Let W = (D, data, adoc) be a Web of Linked Data and let Q^f ^i, be a CLD query (under CMatch-semantics). At any point in the execution 
of ltbExec{S, B, W) let Dg denote the set of LD documents in the currently discovered part S) of W. 

W.l.o.g., let d* be an arbitrary LD document that is (cMatch , S)-reachable from 5* in W. We have to show that during any possible execution 
of ltbExec{S, B , W) there will eventually be an iteration after which d* G Dj). In correspondence to Definition [TO] we distinguish two 
cases: l.)3id £ S : adoc{id) — d* and 2.) -i 3id G 5* : adociid) — d* . 

Case 1.) Before the first iteration, any execution of ltbExec{S, B ,W) initializes S with ^f^^ = {Do , datao , adoco) (cf. line|2]in 
Algorithm [TJ. Recall the definition of Do (cf. ^ in Section l6Tt : 

Do = {adoc{id) | irf G S and id G dom(adoc)} 

Since 3id £ S : adoc{id) = d* it holds d* G Do. Due to the initialization Dxi ~ Do we have d* G Dji before the first iteration. 

Case 2.) If ^ 3 jrf G S : adociid) — d*, it must hold that the Web link graph for VK contains at least one finite path (do, ... ,d„) of 
(cMatch, B) -reachable LD documents di where i) 3 id G 5* : adociid) — do ii) d„ = d*, and iii) for each i G {1, ... , ti} it holds: 

3t G dataidi-i) : ^3id G ids(t) : {adoc{id) = di f\ CMatch (i, id, B) = true)j (6) 

Let (do, ... , d%) be such a path. In the following, we show by induction over i G {0, ... , n} that there will eventually be an iteration (during 
any possible execution of ltbExec{S, B, W)) after which Dj) contains d*^ = d* . 

Ba.se case (i = 0): We have already shown for case 1.) that do G Dn before the first iteration in any possible execution of ItbExeciS, B,W). 

Induction step (i > 0): W.l.o.g., for the following discussion we assume a particular execution of ItbExeciS, B ,W). Our inductive 
hypothesis is that during this execution there will eventually be an iteration itj after which di_i G Dsi. Based on this hypothesis we show 
that there will be an iteration iij+a after which d* G Dj). We distinguish two cases: either after iteration itj it already holds d* G Dg or it 
still holds d* ^ Ds) . We have to discuss the latter case only. 

Due to l|6j exist t* G dataid*_i) and id* G ids(t*) such that adociid*) = d* and CMatch(i* , id* ,B)— true. Hence, there must be at 
least one triple pattern tp £ B such that t* matches tp. Let tp* G B be such a triple pattern. Since t* matches tp*, there exists a partial 
solution (T* = i{tp*}, fj,*) with fi* [tp*] — t* and 3 ?i; G dom(/i*) : fi*i?v) — id*. After iteration itj this a* has either been constructed 
(and added to P) or there exists an open AE task (ao, t* ,tp*) which will eventually be executed in some iteration itj+s, resulting in the 
construction of a* . Let itji be the iteration in which a* has been or will be constructed. In this iteration ItbExeciS, B, W) expands S to 
exp^t (D). This expansion results in adding each d G A^i/i*) to Dd (cf. Definition 1 18t. Since 3?w G dom(/x*) : /i*(?t)) = id* and 
adociid*) — d* it holds d* G A^ifi*). Hence, d* will be added to Dj) in iteration itji (if it has not been added before). 

B.18 Proof of Assertion ii) in Lemma |6] 

Let W — (D, data, adoc) be a Web of Linked Data and let Q^f^,, be a CLD query (under CMatch-semantics). At any point in the execution 
of ltbExec{S, B, W) let Djj denote the set of LD documents in the currently discovered part D of W. 



W.l.o.g., let a* = {P* , fi*) be an arbitrary partial solution for Q^f^,, in W. The construction of a* comprises the iterative construction 
of a finite sequence (cro = (Po, /io), ••■ , cr„ — [Pn,iJ.n)) of partial solutions where i) ao is the empty partial solution (cf. Section |6j2), 
ii) Gn = o"*, and iii) for each i £ {1, ... , n} it holds 

3tpeB\P,-i:P,=P,-l\j{tp} and ^J.^-l[P^-l] ^ ^J.^[P^-l] 

We show by induction over i G {0, ... , n} that there will eventually be an iteration (during any possible execution of ltbExec{S, B, W)) 
after which contains (j„ = ct*. 

Base case (i = 0): Any execution of ltbExec{S, B, W) adds ao to ^ before it starts the first iteration (cf. line[T]in Algorithm [TJ. 

Induction step (i > 0): W.l.o.g., for the following discussion we assume a particular execution of ltbExec(S, B ,W). Our inductive 
hypothesis is that there will eventually be an iteration itj after which CTi_i G 'p. Based on this hypothesis we show that there will be an 
iteration itj+s after which ct^ G 'p. We distinguish two cases: either after iteration itj it already holds (7^ G or it still holds cr; ^ We 
have to discuss the latter case only. 

Let tp* £ B he the triple pattern for which Pi = Pi-i U {tp*}. Since ai = {Pi, /li) is a partial solution for Q^f^,, in W, it holds that 
fj,i is a solution for Q^iafh Wsaid, thus, there exists a (cMatch, -B)-reachable LD document d* £ D such that fii[tp*] = t* G data{d*). 
According to Assertion i) in Lemma [5] there will eventually be an iteration after which d* G Dj). By then, Oi has either already been 
constructed and added to or there exists an open AE task (ai-\,t* ,tp''). In the latter case, this task will eventually be executed, resulting 
in the construction and addition of Oi . 

B.19 Proof of Lemma H 

Let 5* C I be a finite set of seed identifiers and let B = {tpi, ... ,tpn} be a BQP such that Q^f^,, is a CLD query (under CMatch- 
semantics). Furthermore, let W = {D, data, adoc) be the Web of Linked Data over which Q^f^,, has to be executed using the iterator based 
implementation of link traversal based query execution that we introduce in Q. 

As a preliminary for proving Lemma|2]we introduce the iterator based implementation approach using the concepts and the formalism that 
is part of our query execution model. 

For the iterator based execution of Q^f^,, over W we assume an order for the triple patterns in B; w.l.o.g. let this order be denoted 
by the indices of the symbols that denote the triple patterns; i.e. tpi G B precedes tpi+\ G B for all i G {1, ... ,n — 1}. Accordingly, 
we write Pk to denote the subset of B that contains the first k triple patterns in the ordered B, that is, for all fc G {1, ... ,n} holds 
Pk = {tp^ eB\l<i<k}. 

Furthermore, let Jo, /i , ... , In be the chain of iterators used for the iterator based execution of Q^f^,, over W. Iterator Jo is a special 
iterator that provides a single, empty partial solution ao (cf. Section[6j2)- For all k G {1, ... , n} iterator Ik is responsible for triple pattern 
tpk from the ordered BQP. We shall see that each Ik provides partial solutions (P, fi) (for Qc^^^^^ in WO for which P — Pk, that is, the 
valuation /i of each such partial solution is a solution for CLD query Qcu^h (1" W). As a consequence, for each partial solution {P„, jj,) 
provided by the last iterator I„, valuation fi can be reported as a solution for Q^f^,, in W. 

During query execution all iterators access and change the (currently) discovered part S) of the queried Web of Linked Data W. Before 
the execution, T> is initialized as Tlf^'^ (cf. Section [6Tt . This initialization may be performed in the Open function of the aforementioned 
special iterator Jq. 

Algorithm |3] presents the GetNext functiorQ implemented by each iterator Ik (for all k G {1, ... ,n}). In order to compute partial 
solutions iterator Ik first consumes a partial solution CTpred = {Pk-i, Mpred) from its predecessor Ik-i (cf. line[2]in Algorithm O. Lines|7] 
and[8]may be understood as a (combined) performance of multiple (open) AE tasks: For each data triple t* that i) is contained in the data of 
all LD documents discovered so far and that ii) matches triple pattern tp'f. — ^pred [tpk ] , iterator Ik adds a partial solution at* to Alk', each at * 
is the {t* , tpfe) -augmentation of (jpred in D (cf. line|7j. Due to the construction of tp'k from tpk (cf. line|6j, any data triple t* that matches tp'k 
also matches tpk and, thus, each at* is also the (t*, tp^) -augmentation of CTpred (in S). After populating Alk, iterator Ik uses all at* G Mk 
to expand the currently discovered part of ^incrementally (cf. line[8j. Hence, lines|7]and[8]may be understood as a (combined) performance 
of all those (open) AE tasks (ct, t, tp) for which a — CTpred, tp ~ tPk, and t* is a data triple that has the aforementioned properties. Due to 
the finiteness of D (cf. Definition[T5]and Proposition|4ll there is only a finite number of such data triples t* and, thus, the number of AE tasks 
iterator Ik performs for CTpred is also finite. As a consequence, to prove Lemma|2]it suffices to show that each iterator Ik only consumes a 
finite number of partial solutions from its predecessor Ik-i- Hence, it suffices to prove the following lemma. 

Lemma 7. The overall number of partial solutions provided by each iterator via its GetNext function is finite. 
Proof of Lemmal?) We prove the lemma by induction over the chain of iterators lo, Ii, ... , In- 
Base case (lo): The special iterator provides a single partial solution ctq. 

Induction step (Ik for k G {1, ... ,n}): Our inductive hypothesis is that iterator Ik-i provides a finite number of partial solutions via 
its GetNext function. Based on this hypothesis we show that iterator Ik provides a finite number of partial solutions via its GetNext 
function. Due to our inductive hypothesis it is sufficient to show that for each partial solution which Ik consumes from Ik-i, Ik provides a 
finite number of partial solutions. Let CTpred = (Ppred, Mpred) be such a partial solution that Ik consumes from Ik~i (line|2]in Algorithm[3j. 
Ik applies /ipred to its triple pattern tpk (line|6j and uses the resulting triple pattern tp'f. = ^pred[iPfel to generate set Alk (line|7]l. Hence, this 
set contains exactly those partial solutions that Ik provides based on CTpred (Iinesll0ltoll2t. However, Alk is finite because Ik generates Mk 
on a particular snapshot of the discovered part S) of Vt^and D is finite at any point during query execution. □ 



'From the three versions of the iterator based implementation approach that we introduce in 0, Algorithm [3] corresponds to the first, most 
naive version. That is, Algorithm[3]neither applies the idea of URI prefetching nor the idea of non-blocking iterators |7). 



Algorithm 3 GetNext function for iterator Ik in our iterator based implementation of link traversal based query execution |7). 



Require: 

- a triple pattern tpk ; 

- a predecessor iterator lu-i', 

- the currently discovered part D of the queried Web of Linked Data W(note, all iterators have access to D)\ 

- an initially empty set that allows the iterator to keep (precomputed) partial solutions between calls of this GetNext function 

1: while Mfe = do 

2: CTpred /fc — 1-GetNext //consume partial solution from direct predecessor 

3; if cTpred = endOfFile then 
4: return endOfFile 

5: end if 

6: ip'fc :=Mpred[iPfc] ///Jp„d is the valuation in o-p„d = (Ppred, Mpred) 

7; Mfe := { augf, (ffpred) 1 1* matches tp'f. and t* G AllData(£)) } // construct partial solutions 

8: for all a' = {P', fl') E Alk do D:—exp^(Tl) end for // expand £) using all newly constracted partial solutions 

9: end while 

10: a' := an element in Mk 
11: Mfe :=Mfe 
12: return a' 



B.20 Proof of Theorem E 

The guarantee for termination is a direct consequence of Lemma[2] The whole chain of iterators performs a finite number of AE tasks only. 
The performance of each AE task terminates because all operations in Algorithm[3]are synchronized and are guaranteed to terminate. 

It remains to show that the set of valuations reported by any iterator based execution is always a finite subset of the corresponding query 
result: Let 5* C X be a finite set of seed identifiers and let B — {tpi, ... , tp„} be a BQP such that Q^f^,, is a CLD query (under CMatch- 
semantics). Furthermore, let W— {D, data, adoc) be the Web of Linked Data over which Q^^^f^,, has to be executed. 

For the iterator based execution of Qc^^^^^ "^^r Wwe assume an order for the triple patterns in B; w.l.o.g. let this order be denoted by 
the indices of the symbols that denote the triple patterns; i.e. tpi G B precedes tpi+i G B for all i G {1, ... ,n — 1}. Furthermore, let 
7o, Ii, ... , In be the chain of iterators as introduced in the proof for Lemma|2](cf. Section |B.19l ). 

For our proof we use the following lemma. 

Lemma 8. For any partial solution a = (P, /i) provided by the GetNext function of iterator In holds P — Pn. 

Proof of Lemma|8) We prove the lemma by induction over the chain of iterators Jo , Ji , . . . , In- 
Base case (lo): The special iterator provides a single partial solution ctq ~ (Po, Mo) which covers the empty part Po = of B. 

Induction step (Ik for k G {1, ... , n}}: Our inductive hypothesis is that for any partial solution a = (P, /i) provided by the GetNext 
function of iterator Ik-i holds P — Pk-i- Based on this hypothesis we show that for any partial solution a' = {P',fi') provided by the 
GetNext function of iterator Ik holds P' — Pk- However, this is easily checked in Algorithm[3] As we discuss in Section |B.19l each partial 
solution a' = (P', /i') added to (any o-pred-specific version of) Mfe (cf. lineQ and returned later (cf. line 1 12b is a (t* , tpfe) -augmentation 
of some partial solution Upred = (Ppred, Mpred) consumed from Ik-i- According to our inductive hypothesis Ppred ~ Pfe-i- Therefore, 
P' = Pfe„i U {tpk} = Pfe (cf. Definitionini □ 

Lemma[8]shows that each partial solution (P„ , jj.) computed by the last iterator of the chain of iterators covers the whole BQP of the executed 
CLD query Qf^f^,, (recall, B = Pn). Hence, each valuation fi that the iterator based execution reports from such a partial solution {Pn, ti), 
is a solution for Qr.'f , over W. 

It remains to show that the iterator based execution may always only report a finite number of such solutions. This result, however, is a 
direct consequence of Lemma[7](cf. Section lB.19l l. 



